Arquivo da tag: Metodologia científica

The End of Theory: The Data Deluge Makes the Scientific Method Obsolete (Wired)

wired.com

Chris Anderson, Science, 06.23.2008 12:00 PM


Illustration: Marian Bantjes “All models are wrong, but some are useful.”

So proclaimed statistician George Box 30 years ago, and he was right. But what choice did we have? Only models, from cosmological equations to theories of human behavior, seemed to be able to consistently, if imperfectly, explain the world around us. Until now. Today companies like Google, which have grown up in an era of massively abundant data, don’t have to settle for wrong models. Indeed, they don’t have to settle for models at all.

Sixty years ago, digital computers made information readable. Twenty years ago, the Internet made it reachable. Ten years ago, the first search engine crawlers made it a single database. Now Google and like-minded companies are sifting through the most measured age in history, treating this massive corpus as a laboratory of the human condition. They are the children of the Petabyte Age.

The Petabyte Age is different because more is different. Kilobytes were stored on floppy disks. Megabytes were stored on hard disks. Terabytes were stored in disk arrays. Petabytes are stored in the cloud. As we moved along that progression, we went from the folder analogy to the file cabinet analogy to the library analogy to — well, at petabytes we ran out of organizational analogies.

At the petabyte scale, information is not a matter of simple three- and four-dimensional taxonomy and order but of dimensionally agnostic statistics. It calls for an entirely different approach, one that requires us to lose the tether of data as something that can be visualized in its totality. It forces us to view data mathematically first and establish a context for it later. For instance, Google conquered the advertising world with nothing more than applied mathematics. It didn’t pretend to know anything about the culture and conventions of advertising — it just assumed that better data, with better analytical tools, would win the day. And Google was right.

Google’s founding philosophy is that we don’t know why this page is better than that one: If the statistics of incoming links say it is, that’s good enough. No semantic or causal analysis is required. That’s why Google can translate languages without actually “knowing” them (given equal corpus data, Google can translate Klingon into Farsi as easily as it can translate French into German). And why it can match ads to content without any knowledge or assumptions about the ads or the content.

Speaking at the O’Reilly Emerging Technology Conference this past March, Peter Norvig, Google’s research director, offered an update to George Box’s maxim: “All models are wrong, and increasingly you can succeed without them.”

This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.

The big target here isn’t advertising, though. It’s science. The scientific method is built around testable hypotheses. These models, for the most part, are systems visualized in the minds of scientists. The models are then tested, and experiments confirm or falsify theoretical models of how the world works. This is the way science has worked for hundreds of years.

Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise.

But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. Consider physics: Newtonian models were crude approximations of the truth (wrong at the atomic level, but still useful). A hundred years ago, statistically based quantum mechanics offered a better picture — but quantum mechanics is yet another model, and as such it, too, is flawed, no doubt a caricature of a more complex underlying reality. The reason physics has drifted into theoretical speculation about n-dimensional grand unified models over the past few decades (the “beautiful story” phase of a discipline starved of data) is that we don’t know how to run the experiments that would falsify the hypotheses — the energies are too high, the accelerators too expensive, and so on.

Now biology is heading in the same direction. The models we were taught in school about “dominant” and “recessive” genes steering a strictly Mendelian process have turned out to be an even greater simplification of reality than Newton’s laws. The discovery of gene-protein interactions and other aspects of epigenetics has challenged the view of DNA as destiny and even introduced evidence that environment can influence inheritable traits, something once considered a genetic impossibility.

In short, the more we learn about biology, the further we find ourselves from a model that can explain it.

There is now a better way. Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

The best practical example of this is the shotgun gene sequencing by J. Craig Venter. Enabled by high-speed sequencers and supercomputers that statistically analyze the data they produce, Venter went from sequencing individual organisms to sequencing entire ecosystems. In 2003, he started sequencing much of the ocean, retracing the voyage of Captain Cook. And in 2005 he started sequencing the air. In the process, he discovered thousands of previously unknown species of bacteria and other life-forms.

If the words “discover a new species” call to mind Darwin and drawings of finches, you may be stuck in the old way of doing science. Venter can tell you almost nothing about the species he found. He doesn’t know what they look like, how they live, or much of anything else about their morphology. He doesn’t even have their entire genome. All he has is a statistical blip — a unique sequence that, being unlike any other sequence in the database, must represent a new species.

This sequence may correlate with other sequences that resemble those of species we do know more about. In that case, Venter can make some guesses about the animals — that they convert sunlight into energy in a particular way, or that they descended from a common ancestor. But besides that, he has no better model of this species than Google has of your MySpace page. It’s just data. By analyzing it with Google-quality computing resources, though, Venter has advanced biology more than anyone else of his generation.

This kind of thinking is poised to go mainstream. In February, the National Science Foundation announced the Cluster Exploratory, a program that funds research designed to run on a large-scale distributed computing platform developed by Google and IBM in conjunction with six pilot universities. The cluster will consist of 1,600 processors, several terabytes of memory, and hundreds of terabytes of storage, along with the software, including IBM’s Tivoli and open source versions of Google File System and MapReduce.111 Early CluE projects will include simulations of the brain and the nervous system and other biological research that lies somewhere between wetware and software.

Learning to use a “computer” of this scale may be challenging. But the opportunity is great: The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.

There’s no reason to cling to our old ways. It’s time to ask: What can science learn from Google?

Chris Anderson (canderson@wired.com) is the editor in chief of Wired.

Related The Petabyte Age: Sensors everywhere. Infinite storage. Clouds of processors. Our ability to capture, warehouse, and understand massive amounts of data is changing science, medicine, business, and technology. As our collection of facts and figures grows, so will the opportunity to find answers to fundamental questions. Because in the era of big data, more isn’t just more. More is different.

Correction:
1 This story originally stated that the cluster software would include the actual Google File System.
06.27.08

Against storytelling of scientific results (Nature Methods)

Yarden Katz

Nature Methods 10, 1045 (2013) doi:10.1038/nmeth.2699 – Published online

30 October 2013

To the Editor:

Krzywinski and Cairo1 beautifully illustrate the widespread view that scientific writing should follow a journalistic ‘storytelling’, wherein the choice of what data to plot, and how, is tailored to the message the authors want to deliver. However, they do not discuss the pitfalls of the approach, which often result in a distorted and unrepresentative display of data—one that does not do justice to experimental complexities and their myriad of interpretations.

If we project the features of great storytellers onto a scientist, the result is a portrait of a scientist far from ideal. Great storytellers embellish and conceal information to evoke a response in their audience. Inconvenient truths are swept away, and marginalities are spun to make a point more spectacular. A storyteller would plot the data in the way most persuasive rather than most informative or representative.

Storytelling encourages the unrealistic view that scientific projects fit a singular narrative. Biological systems are difficult to measure and control, so nearly all experiments afford multiple interpretations—but storytelling actively denies this fact of science.

The ‘story-told’ scientific paper is a constrictive mapping between figures and text. Figures produced by masters of scientific storytelling are so tightly controlled to match the narrative that the reader is left with little to ponder or interpret. Critical reading of such papers becomes a detective’s game, in which one reads between the lines for clues of data relegated to a supplement for their deviance from ‘the story’.

Dissecting the structure of scientific papers, Bruno Latour explains the utility of the storytelling approach in giving readers the sense that they are evaluating the data along with the authors while simultaneously persuading them of the story. The storytelling way to achieve this is “to lay out the text so that wherever the reader is there is only one way to go”2—or as Krzywinski and Cairo put it, “Inviting readers to draw their own conclusions is risky”1. Authors prevent this by “carefully stacking more black boxes, less easily disputable arguments”2. This is consistent with the visualization advice that Krzywinski and Cairo give: the narrower and more processed the display of the data is to fit the story, the more black boxes are stacked, making it harder for the reader to access data raw enough to support alternative models or ‘stories’.

Readers and authors know that complex experiments afford multiple interpretations, and so such deviances from the singular narrative must be present somewhere. It would be better for both authors and readers if these could be discussed openly rather than obfuscated. For those who plan to follow up on the results, these discrepancies are often the most important. Storytelling therefore impedes communication of critical information by restricting the scope of the data to that agreeable with the story.

Problems arise when experiments are driven within a storytelling framework. In break rooms of biology research labs, one often hears: “It’d be a great story if X regulated Y by novel mechanism Z.” Experiments might be prioritized by asking, “Is it important for your story?” Storytelling poses a dizzying circularity: before your findings are established, you should decide whether these are the findings you would like to reach. Expectations of a story-like narrative can also be demoralizing to scientists, as most experimental data do not easily fold into this framing.

Finally, a great story in the journalistic sense is a complete one. Papers that make the unexplained observations transparent get penalized in the storytelling framework as incomplete. This prevents the communal puzzle-solving that arises by piecing together unexplained observations from multiple papers.

The alternative to storytelling is the usual language of evidence and arguments that are used—with varying degrees of certainty—to support models and theories. Speaking of models and their evidence goes back to the oldest of scientific discourse, and this framing is also standard in philosophy and law. This language allows authors to discuss evidence for alternative models without imposing a singular journalistic-like story.

There might be other roles for storytelling. Steven McKnight’s lab recently found, entirely unexpectedly, that a small molecule can be used to purify a complex of RNA-binding proteins in the cell, revealing a wide array of striking biological features3. It is that kind of story of discovery—what François Jacob called “night science”—that is often best suited for storytelling, though these narratives are often deemed by scientists as irrelevant ‘fluff’.

As practiced, storytelling shares more with journalism than with science. Journalists seek a great story, and the accompanying pressures sometimes lead to distortion in the portrayal of events in the press. When exerted on scientists, these pressures can yield similar results. Storytelling encourages scientists to design experiments according to what constitutes a ‘great story’, potentially closing off unforeseen avenues more exciting than any story imagined a priori. For the alternative framing to be adopted, editors, reviewers and authors (particularly at the higher-profile journals) will have to adjust their evaluation criteria and reward authors who choose representative displays while discussing alternative models to their own.

References

  1. Krzywinski, M. & Cairo, A. Nat. Methods 10, 687 (2013).
  2. Latour, B. Science in Action (Harvard Univ. Press, 1987).
  3. Baker, M. Nat. Methods 9, 639 (2012).

Building Cyberinfrastructure Capacity for the Social Sciences (American Anthropological Association)

Posted on October 9, 2013 by Joslyn O.

Today’s guest blog post is by Dr. Emilio Moran. Dr. Moran is Distinguished Professor Emeritus, Indiana University and Visiting Hannah Distinguished Professor, Michigan State University.

emilio-moran_profileThe United States and the world are changing rapidly.  These new conditions challenge the ability of the social, behavioral and economic sciences to understand what is happening at a national scale and in people’s daily local lives.   Forces such as globalization, the shifting composition of the economy, and the revolution in information brought about by the internet and social media are just a few of the forces that are changing Americans’ lives.  Not only has the world changed since data collection methods currently used were developed, but the ways now available to link information and new data sources have radically changed. Expert panels have called for increasing the cyber-infrastructure capability of the social, behavioral, and economic (SBE) sciences so that our tools and research infrastructure keep pace with these changing social and informational landscapes.  A series of workshops for the past three years has met to address these challenges and they now invite you to provide them with feedback on the proposal below and you are invited to attend a Special Event at this year’s AAA meeting in Chicago, Saturday, November 23, 2013 from 1215 to 1:30 pm at the Chicago Hilton Boulevard C room.

Needed is a new national framework, or platform, for social, behavioral and economic research that is both scalable and flexible; that permits new questions to be addressed; that allows for rapid response and adaptation to local shocks (such as extreme weather events or natural resource windfalls); and that facilitates understanding local manifestations of national phenomena such as economic downturns.  To advance a national data collection and analysis infrastructure, the approach we propose —  building a network of social observatories — is a way to have a sensitive instrument to measure how local communities respond to a range of natural and social conditions over time.  This new scientific infrastructure will enable the SBE sciences to contribute to societal needs at multiple levels and will facilitate collaboration with other sciences in addressing questions of critical importance.

Our vision is that of a network of observatories designed from the ground up, each observatory representing an area of the United States.  From a small number of pilot projects the network would develop (through a national sampling frame and protocol) into a representative sample of the places where people live and the people who live there. Each observatory would be an entity, whether physical or virtual, that is charged with collecting, curating, and disseminating data from people, places, and institutions in the United States.  These observatories must provide a basis for inference from what happens in local places to a national context and ensure a robust theoretical foundation for social analysis.  This is the rationale for recommending that this network of observatories be built on a population-based sample capable of addressing the needs of the nation’s diverse people but located in the specific places and communities where they live and work.  Unlike most other existing research platforms, this population and place-based capability will ensure that we understand not only the high-density urban and suburban places where the majority of the population lives, but also the medium- and low-density exurban and rural places that represent a vast majority of the land area in the nation.

To accomplish these objectives, we propose to embed in these regionally-based observatories a nationally representative population-based sample that would enable the observatory data to be aggregated in such a way as to produce a national picture of the United States on an ongoing basis.  The tentative plan would be to select approximately 400 census tracts to represent the U.S. population while also fully capturing the diversity that characterizes local places. The individuals, institutions and communities in which these census tracts are embedded will be systematically studied over time and space by observatories spread across the country. During the formative stages the number of census tracts and the number of observatories that might be needed, given the scope of the charge that is currently envisioned, will be determined.

These observatories will study the social, behavioral and economic experiences of the population in their physical and environmental context at fine detail. The observatories are intended to stimulate the development of new directions and modes of inquiry.  They will do so through the use of diverse complementary methods and data sources including ethnography, experiments, administrative data, social media, biomarkers, and financial and public health record. These observatories will work closely with local and state governments to gain access to administrative records that provide extensive data on the population in those tracts (i.e. 2 million people) thereby providing a depth of understanding and integration of knowledge that is less invasive and less subject to declining response rates than survey-derived data.

To attain the vision proposed here we need the commitment and enthusiasm of the community to meet these challenges and the resolve to make this proposed network of observatories useful to the social sciences and society. For more details on our objectives and reports from previous meetings, visit http://socialobservatories.org/. Please contribute your ideas at the site so that the proposal can benefit from your input and come to Chicago for the Special Event on Saturday, November 23, 2013. We are particularly interesting in hearing how this platform could help you in your future research. This is an opportunity for anthropological strengths in ethnography and local research to contribute its insights in a way that will make a difference for local people and for the nation.

Emilio F. Moran, co-Chair of the SOCN
Distinguished Professor Emeritus, Indiana University and
Visiting Hannah Distinguished Professor, Michigan State University

The Social Sciences’ ‘Physics Envy’ (N.Y.Times)

OPINION – GRAY MATTER

Jessica Hagy

By KEVIN A. CLARKE AND DAVID M. PRIMO

Published: April 01, 2012

HOW scientific are the social sciences?

Economists, political scientists and sociologists have long suffered from an academic inferiority complex: physics envy. They often feel that their disciplines should be on a par with the “real” sciences and self-consciously model their work on them, using language (“theory,” “experiment,” “law”) evocative of physics and chemistry.

This might seem like a worthy aspiration. Many social scientists contend that science has a method, and if you want to be scientific, you should adopt it. The method requires you to devise a theoretical model, deduce a testable hypothesis from the model and then test the hypothesis against the world. If the hypothesis is confirmed, the theoretical model holds; if the hypothesis is not confirmed, the theoretical model does not hold. If your discipline does not operate by this method – known as hypothetico-deductivism – then in the minds of many, it’s not scientific.

Such reasoning dominates the social sciences today. Over the last decade, the National Science Foundation has spent many millions of dollars supporting an initiative called Empirical Implications of Theoretical Models, which espouses the importance of hypothetico-deductivism in political science research. For a time, The American Journal of Political Science explicitly refused to review theoretical models that weren’t tested. In some of our own published work, we have invoked the language of model testing, yielding to the pressure of this way of thinking.

But we believe that this way of thinking is badly mistaken and detrimental to social research. For the sake of everyone who stands to gain from a better knowledge of politics, economics and society, the social sciences need to overcome their inferiority complex, reject hypothetico-deductivism and embrace the fact that they are mature disciplines with no need to emulate other sciences.

The ideal of hypothetico-deductivism is flawed for many reasons. For one thing, it’s not even a good description of how the “hard” sciences work. It’s a high school textbook version of science, with everything messy and chaotic about scientific inquiry safely ignored.

A more important criticism is that theoretical models can be of great value even if they are never supported by empirical testing. In the 1950s, for instance, the economist Anthony Downs offered an elegant explanation for why rival political parties might adopt identical platforms during an election campaign. His model relied on the same strategic logic that explains why two competing gas stations or fast-food restaurants locate across the street from each other – if you don’t move to a central location but your opponent does, your opponent will nab those voters (customers). The best move is for competitors to mimic each other.

This framework has proven useful to generations of political scientists even though Mr. Downs did not empirically test it and despite the fact that its main prediction, that candidates will take identical positions in elections, is clearly false. The model offered insight into why candidates move toward the center in competitive elections, and it proved easily adaptable to studying other aspects of candidate strategies. But Mr. Downs would have had a hard time publishing this model today.

Or consider the famous “impossibility theorem,” developed by the economist Kenneth Arrow, which shows that no single voting system can simultaneously satisfy several important principles of fairness. There is no need to test this model with data – in fact, there is no way to test it – and yet the result offers policy makers a powerful lesson: there are unavoidable trade-offs in the design of voting systems.

To borrow a metaphor from the philosopher of science Ronald Giere, theories are like maps: the test of a map lies not in arbitrarily checking random points but in whether people find it useful to get somewhere.

Likewise, the analysis of empirical data can be valuable even in the absence of a grand theoretical model. Did the welfare reform championed by Bill Clinton in the 1990s reduce poverty? Are teenage employees adversely affected by increases in the minimum wage? Do voter identification laws disproportionately reduce turnout among the poor and minorities? Answering such questions about the effects of public policies does not require sweeping theoretical claims, just careful attention to the data.

Unfortunately, the belief that every theory must have its empirical support (and vice versa) now constrains the kinds of social science projects that are undertaken, alters the trajectory of academic careers and drives graduate training. Rather than attempt to imitate the hard sciences, social scientists would be better off doing what they do best: thinking deeply about what prompts human beings to behave the way they do.

Kevin A. Clarke and David M. Primo, associate professors of political science at the University of Rochester, are the authors of “A Model Discipline: Political Science and the Logic of Representations.”