Use of social media to monitor and predict outbreaks and public opinion on health topics - PDF

Please download to get full document.

View again

of 96
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

Screenplays & Play

Published:

Views: 5 | Pages: 96

Extension: PDF | Download: 0

Share
Related documents
Description
University of Iowa Iowa Research Online Theses and Dissertations 2014 Use of social media to monitor and predict outbreaks and public opinion on health topics Alessio Signorini University of Iowa Copyright
Transcript
University of Iowa Iowa Research Online Theses and Dissertations 2014 Use of social media to monitor and predict outbreaks and public opinion on health topics Alessio Signorini University of Iowa Copyright 2014 Alessio Signorini This dissertation is available at Iowa Research Online: Recommended Citation Signorini, Alessio. Use of social media to monitor and predict outbreaks and public opinion on health topics. PhD (Doctor of Philosophy) thesis, University of Iowa, Follow this and additional works at: Part of the Computer Sciences Commons USE OF SOCIAL MEDIA TO MONITOR AND PREDICT OUTBREAKS AND PUBLIC OPINION ON HEALTH TOPICS by Alessio Signorini A thesis submitted in partial fulfillment of the requirements for the Doctor of Philosophy degree in Computer Science in the Graduate College of The University of Iowa December 2014 Thesis Supervisor: Professor Alberto Maria Segre Copyright by ALESSIO SIGNORINI 2014 All Rights Reserved Graduate College The University of Iowa Iowa City, Iowa CERTIFICATE OF APPROVAL PH.D. THESIS This is to certify that the Ph.D. thesis of Alessio Signorini has been approved by the Examining Committee for the thesis requirement for the Doctor of Philosophy degree in Computer Science at the December 2014 graduation. Thesis committee: Alberto Maria Segre (Thesis Supervisor) James Cremer Ted Herman Padmini Srinivasan Philip Polgreen ACKNOWLEDGMENTS First and foremost I want to thank my advisor Alberto Maria Segre for helping me with my studies and my life from the moment I set foot in the United States. I really appreciate all his contributions of times, ideas, funding and support to make my P.h.D. experience productive and stimulating. He has been the best advisor I could hope for and this thesis has been completed largely thanks to his patience, his nudges, and the freedom he allowed me to have in exploring my own research interest. I would like to thank my father Gianluca Signorini and my grandfather Luciano Redini for having introduced me to electronics and computers. I would also like to thank my mother Antonella and my entire family for being patient with me while away for so long to pursue my dreams. Special thanks also go to my mentors Bruno Codenotti, Antonio Gulli, Apostolos Gerasoulis, Kimbal Musk and David Tisch for having guided me along the way and supported my ambitions and crazy projects. I would have never accomplished all I have so far without their support. Lastly, I would like to thank Philip Polgreen for having introduced me to the field of epidemiology and worked with me on this research, Sheryl Semler and Catherine Till for all the help in making sure I was always registered for class and had all my papers in order, and all the great friends I have around the world, who consciously or unconsciously had a major impact in my life. ii ABSTRACT The world in which we live has changed rapidly over the last few decades. Threats of bioterrorism, influenza pandemics, and emerging infectious diseases coupled with unprecedented population mobility led to the development of public health surveillance systems. These systems are useful in detecting and responding to infectious disease outbreaks but often operate with a considerable delay and fail to provide the necessary lead time for optimal public health response. In contrast, syndromic surveillance systems rely on clinical features (e.g., activities prompted by the onset of symptoms) that are discernible prior to diagnosis to warn of changes in disease activity. Although less precise, these systems can offer considerable lead time. Patient information may be acquired from multiple existing sources established for other purposes, including, for example, emergency department primary complaints, ambulance dispatch data, and over-the-counter medication sales. Unfortunately, these data are often expensive, sometimes difficult to obtain and almost always hard to integrate. Fortunately, the proliferation of online social networks makes much more information about our daily habits and lifestyles freely available and easily accessible on the web. Twitter, Facebook and FourSquare are only a few examples of the many websites where people voluntarily post updates on their daily behaviors, health status, and physical location. In this thesis we develop and apply methods to collect, filter and analyze the content of social media postings in order to make predictions. As a proof of iii concept we used Twitter data to predict public opinion in the form of the outcome of a popular television show. We then used the same methods to monitor and track public perception of influenza during the H1N1 epidemic, and even to predict disease burden in real time, which is a measurable advance over current public health practice. Finally, we used location specific social media data to model human travels and show how this data can improve our prediction of disease burden. iv TABLE OF CONTENTS LIST OF TABLES vii LIST OF FIGURES viii CHAPTER 1 DISEASE SURVEILLANCE Importance of Surveillance Types of Surveillance Systems National Electronic Disease Surveillance System Introduction to Social Media Blogs Wikipedia Twitter Facebook Flickr FourSquare Other Sources of Data: Proxy & Search Logs Privacy Concerns New Technology and Disease Surveillance Related Research Social Media for Disease Surveillance RESEARCH APPROACH AND METHODOLOGIES Twitter Anatomy of a Tweet Twitter s API Data Gathering and Normalization Stemming Language Classification Spam and Unrelated Tweet Removal General Applicability Support Vector Regression APPLICATIONS v 3.1 Predicting the American Idol 2009 Winner Monitoring the Swine Flu Pandemic Using Twitter to Estimate H1N1 Influenza Activity Inferring Travel from Social Media Predicting Local Flu Trends using Geolocated Tweets City Level Flu Data Travel Data Flu Correlation between Cities: distance vs. flow Predicting Flu Trends across Cities CONCLUSION APPENDIX LIST OF KEYWORDS A.1 Keywords used in study of Section A.2 List of suffixes removed in step 5 of Porter s Algorithm. 102 BIBLIOGRAPHY vi LIST OF TABLES Table 1.1 Percentage of Americans performing common activities online Types of Tweets posted by users Types of Link shared on Tweets by users Values of commonly awarded checkins on Foursquare Search Keywords used as filters in Twitter s API MMWR Cities removed due to lack of data MMWR Cities removed due to overlapping metro areas Population vs. Twitter Penetration - Top 10 cities Square Correlation Coefficients for each approach Square Correlation Coefficients for most predictable cities Square Correlation Coefficients for most difficult cities vii LIST OF FIGURES Figure 1.1 Number of Visits to 2009 Swine Flu Outbreaks page on Wikipedia Top categories of Foursquare checkins Examples of Foursquare Badges Example of Tweets Use of Hashtags in a Tweet Example of a Retweet Example of Favorites for a Tweet Example of Direct Reply on Twitter Area used as geographical filter for Twitter Percentage of Non-English Tweets given Twitter Profile Language Percentage of English vs. non-english Tweets given a US Timezone Percentage of English Tweets by Hour of Day Percentage of Spam/Non-Spam Tweets with certain features Tweet Volume associated to each American Idol 8 contestant Google Search Volume for American Idol 7 Contestants Google Search Volume for American Idol 8 Contestants Relative Number of Tweets during the Finale of American Idol Screenshot of H1N1 Realtime Monitor Interface Tweet Volume for each Category by Date Predicted vs. Reported ILI% in the U.S. for the 2009 Flu Season Predicted vs. Reported ILI% in Region 2 for the 2009 Flu Season Statistics on Geographical Distance between Foursquare Checkins Statistics on Time between Foursquare Checkins Travel plot with state level resolution Density of Foursquare checkins in New York City Density of Foursquare checkins in Manhattan by time of day User paths across New York City inferred through Foursquare checkins Flu & Pneumonia Deaths in New York City, NY for Distance vs. Correlation for Atlanta, GA Flow vs. Correlation for Atlanta, GA viii 1 CHAPTER 1 DISEASE SURVEILLANCE The world in which we live has changed rapidly over the last few decades. Threats of bioterrorism, influenza pandemics, and emerging infectious diseases coupled with unprecedented population mobility led to the development of surveillance systems for public health. According to Thacker and Berkelman [84] these systems perform an ongoing systematic collection, analysis, and interpretation of data, closely integrated with the timely dissemination of these data to those responsible for preventing and controlling disease and injury and are generally put in place by governmental organizations (e.g., ministries of health or finance) to assess in real time the health status and behavior of certain populations to allow decision makers to lead and manage resources more effectively. Since these monitoring systems can directly measure what is happening in a population they can be used both to assess the need for an intervention and directly verify its effects. The key objective of public health surveillance is to guide interventions. The monitoring systems put in place generally aim to gather scientific and factual data essential to make informed decision and plan appropriate public health responses, and their design and implementation is often influenced by their objectives. Different public health objectives and the actions necessary to reach them may require different information systems. The type of action to be taken, when and how often it needs to be performed, what information is needed to take or monitor the action and how frequently the information is needed determines the type of surveillance or health information system to be used. For example, if the goal is to prevent the spread of acute infectious diseases (e.g., SARS) the surveillance system needs to be 2 effective in providing early warning signs so that managers can intervene quickly and stop potential epidemics. In contrast, the surveillance of chronic diseases (e.g., tuberculosis) or health-related behaviors (e.g., tobacco smoking) that have a relatively slow change rate, can be simply performed through demographic and health surveys done once a year. 1.1 Importance of Surveillance The World Health Organization (WHO) and the World Bank consider [109] surveillance to be an essential function of a public health system, improving the efficiency and effectiveness of the services performed thanks to targeted interventions and documentation of its effects on the population. Since 1975, the Center for Diseases Control and Prevention (CDC) and the WHO have collaborated with more than 30 countries to strengthen health systems and address training needs for disease detection and response in a country-specific, flexible, and sustainable manner. State members of WHO need to comply with the guidelines set by the International Health Regulations and have key persons and core capacities in surveillance. In 1993 the WHO developed (in Africa) the Integrated Disease Surveillance and Response (IDSR) strategy [110] which linked epidemiological and laboratory data at all levels of the health system, putting an emphasis on integrating surveillance with response. The approach was very comprehensive and included detection, registration and confirmation of case-patient, reporting, analysis and use of data, outbreak investigations and contact tracing. In the late 1980s, while monitoring a population of 60 million people the Philippine Department of Health s (PDOH) integrated management information system [29] detected less than one outbreak per year. Nine years later the PDOH introduced the National Epidemic Sentinel Surveillance System, a hospital-based 3 sentinel surveillance system which provided rules for both the flow of data and the personnel requirements. The pilot study was a success and the system was integrated into the public health system and expanded to include HIV serological and behavioral risk surveillance. In 1995 alone, the system detected and investigated about 80 outbreaks. In 2005, China launched its first Field Epidemiology Training Program (FETP) to rapidly expand its surveillance and response capacity, while Brazil and Argentina chose to use World Bank funds to improve their own systems. At the same time, with more and more data available through various channels, the U.S. Agency for International Development (USAID) redesigned its surveillance strategy to focus on the use of data to improve public health interventions [98]. These new systems were adapted to their local reality by many countries: Guatemala s marriage of its FETP (part of a larger, Central American FETP) with the Data for Decision Making program [64] is one example, and India, with its decentralized system, complex cultural and population dynamics, and wide variance in the sophistication of public health institutions, provides another model for strengthening national surveillance. As of today more than half of world s population lives in a country where public health surveillance is carried out by staff members and trainees of FETPs or allied programs. Programs like the Epidemic Intelligence Service in the United States, the European Program for Intervention Epidemiology Training, and Public Health Schools without Walls provide most of the surveillance and response to emerging infections in these countries in addition to train the majority of the public health workers in the sector. 4 1.2 Types of Surveillance Systems In their 1976 article [38] on the International Journal of Epidemiology, Foege and others stated the reason for collecting, analyzing, and disseminating information on a disease is to control that disease. Collection and analysis should not be allowed to consume resources if action does not follow Public health surveillance systems should be implemented in such a way to provide valid and timely information to decision makers at the lowest possible cost. The utility of the data collected can be viewed as immediate, annual or archival, on the basis of the actions that can be taken. Similarly, spatial resolution of the data collected (e.g., macro vs. micro areas) may be sacrificed with the aim to improve timeliness and save resources. For these reasons, it is not always possible nor effective to deploy complex surveillance systems. In developing countries, for example, a critical challenge of the health sector is to ensure quality and effectiveness of the surveillance in decentralized environments. National-level programs and surveillance system managers may lose control of the quality and timeliness of the data collected and donors, perceiving weakness in the national system, may create parallel nongovernmental surveillance systems to gather directly the data they need. These systems generally work in the short-term, but in the long run, weaken even more the public health surveillance programs already in place. Many types of surveillance systems exist [4] and are effectively deployed everywhere around the world. Among the most commonly used ones are: Vital Statistics Keeping records of the number of births and deaths has been long used as indicator of overall population health. Infant mortality rate (the 5 number of deaths among infants per 1,000 births) is also used as risk factor for a variety of adverse health outcomes. In the United States (US), vital statistics are available from the National Center for Health Statistics and from state vital records offices. The CDC also operates an online system (called CDC WONDER) containing data on births, deaths, and many diseases. Registries Registries are a simple type of surveillance system used for particular conditions (e.g., cancer or birth defects). They are often established at a state level to collect information about the number of people diagnosed with a certain conditions and are generally used to improve prevention programs. Population Surveys Routine surveys are surveillance tools are generally repeated on a regular basis [73] and can be very useful in monitoring chronic diseases and health-related behaviors. While theoretically simple to implement, surveys require a clear definition of the target population to which the results can be generalized. In addition, to avoid bias, the sample size needs to be adequate to the health condition under surveillance (i.e., rare conditions require substantial samples). Two well known national surveys conducted in the U.S. are the Youth Risk Behavior Survey (YRBS) and the Behavior Risk Factor Surveillance System (BRFSS). In these surveys high school students and adults are asked about healthrelated behaviors such as substance use, nutrition, sexual behavior, and physical activity. The results are used to monitor trends in health behavior (e.g, YRBS showed decline in youth smoking from 36% in 1997 to 20% in 2007), plan public health programs, and evaluate public health policies at national and state levels. Disease Reporting The International Health Regulations introduced by the WHO require timely reporting to public health officials for certain diseases. In 6 addition, countries are also required to report any public health emergency of international concern. In the United States, disease reporting is mandated by state law and the list of reportable diseases vary by state. States report nationally notifiable diseases to the CDC on a voluntary basis. Adverse Event Surveillance The purpose of these systems is to gather information about negative effects experienced by people who have taken prescribed drugs and other therapeutic agents. Reports may come from health care providers (e.g., physicians, pharmacists, and nurses) as well as members of the general public, such as patients or lawyers, and manufacturers. Some examples of adverse events surveillance focused on patient safety are the FDA Adverse Events Reporting System FAERS [37] and the Vaccine Adverse Events Reporting System 1 (VAERS). The former is operated by the Food and Drug Administration (FDA) while the latter is mostly operated by the CDC. Due to their passive nature, AERS and VAERS may suffer from underreporting or biased reporting, and while they cannot be used to determine whether a drug or vaccine caused a specific adverse health event, they are fairly useful as early warning signals. Sentinel Surveillance In a sentinel surveillance system, a predefined sample of reporting sources agrees to report all cases of defined conditions [73]. When properly implemented, sentinel-based systems offer an effective method of flexible monitoring with limited resources. While these systems are very effective in detecting large health problems, they may be insensitive to rare events (e.g., emergence of a new disease). One of the most well known sentinel surveillance systems used in the United States is for influenza, where selected health care providers report the number of cases of influenza-like illness to their state health department on a 1 vaers.hhs.gov 7 weekly basis, allowing monitoring of macro trends using a relatively small amount of information. Zoonotic Disease Surveillance Zoonotic surveillance system involve systems for detecting animals infected with diseases that can be transmitted to humans. Efforts of this type were very effective [6] in 2001 during an epidemic of West Nile Virus (WNV) in Florida, and led to public health control measures, such as advising the public to protect against mosquito bites and intensifying mosquito abatement efforts. Laboratory Data Public health laboratories that routinely conduct tests for viruses, bacteria, and other pathogens can be another useful source of surveillance data. Laboratory serotyping provides information about cases that are likely to be linked to a common source and is useful for detecting local, state, or national outbreaks. Syndromic Surveillance This method of surveillance has been
Recommended
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks