Multilingual multi-document continuously-updated social networks - PDF

Please download to get full document.

View again

of 10
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

Journals

Published:

Views: 3 | Pages: 10

Extension: PDF | Download: 0

Share
Related documents
Description
Multilingual multi-document continuously-updated social networks Bruno Pouliquen, Ralf Steinberger & Jenya Belyaeva European Commission Joint Research Centre Via Enrico Fermi 1, Ispra (VA), Italy
Transcript
Multilingual multi-document continuously-updated social networks Bruno Pouliquen, Ralf Steinberger & Jenya Belyaeva European Commission Joint Research Centre Via Enrico Fermi 1, Ispra (VA), Italy {Bruno.Pouliquen, Abstract We are presenting a fully-automatic live online system (accessible at that produces monolingual or mixed-language social network graphs showing which groups of persons are being mentioned together in the world news of the last few hours. The basis for this system are name mentions extracted automatically from an average of 35,000 news articles per day in 32 languages. For any given person on the graph, hyperlinks lead to the list of text snippets and to the original texts where the person was mentioned, plus to a dedicated webpage containing additional information about this person gathered in the course of several years. For any link between persons, hyperlinks lead to the list of text snippets and to the full texts where both persons are mentioned. Building multilingual social networks that even cross writing systems (Arabic, Greek, Chinese, etc.) is made possible by exploiting the name database built up by the multilingual online NewsExplorer system (Steinberger et al. 2005), which automatically associates name variants to the same person identifier. We also discuss differences between live social networks generated from the news in different languages for the same time period. Keywords Social Networks, multilinguality, multi-document summarisation, Named Entity Recognition, name variant merging, visualisation. 1. Introduction To a large extent, the factual part of news is about themes or events (taking place at certain locations at a certain time) and about persons or organisations. The news analysis system NewsExplorer (Steinberger et al. 2005; accessible at tries to give views of the news from the axes events (news clusters), locations, named entities (mainly persons and organisations) and time (via time lines, i.e. historical linking of news). In addition to linking news via these entities and axes, news items in NewsExplorer are also linked across languages. In this paper, we present an additional way of allowing users access to news: we present live social networks, i.e. graphs displaying groups of persons that are frequently mentioned together in the news of the last few hours and up to 1 day. Probably the most interesting aspect of the presented approach is the high multilinguality of the system (32 languages) and the fact that names are linked across languages (and writing systems) even if spelt differently and when the names have been inflected. Users can view the multi-document, multilingual and cross-language live system at the site Additionally to the most recent multilingual social networks displayed at that site, it is also possible to produce social networks separately by language or by the country of origin of the news, as well as for documents covering a specific theme. These customised social networks are not accessible to the public, but in this paper we compare the multilingual networks with monolingual networks in four languages (section 5). This social network generation tool takes as input the Europe Media Monitor (EMM; Best et al. 2005) news data and makes furthermore use of the following technology: (a) multilingual name recognition software, (b) approximate name matching software that identifies name variants for the same person, (c) multilingual language-dependent morphological name inflection generation software, and (d) network generation and visualisation software. Tools (a), (b) and (c) are part of the NewsExplorer system, which analyses news every day, links news over time (topic detection and tracking) and across languages (cross-lingual topic tracking), extracts new and known names, collects information about people and visualises the results in various ways. The 12 co-occurrence graphs visible at the abovementioned site are updated every two hours. Graph production starts completely anew every 24 hours at midnight so that users will always see the social network graphs of world-wide news of today. Information found in the news of all 32 languages are fully aggregated and all the results are visualised together. Section 2 points to work with a similar focus. Section 3 summarises the text analysis technology underlying the social network generation. Section 4 focuses on the network generation, size reduction and visualisation. In Section 5, we discuss the network generation results, comparing the mixed language network with various monolingual networks for a sample 8-hour snapshot for Friday 13 July. Section 6 concludes the paper and points to future work. 2. Related Work Due to the large volume of various types of information on the internet, there are now various applications that try to produce person profiles and to exploit similarities for various purposes (e.g. to provide focused advertising, to provide meeting forums, etc.). Some social network services like LinkedIn (LinkedIn 2007) or MySpace (MySpace 2007) build and verify online social networks, connecting registered users by different types of interests Pouliquen, Steinberger & Belyaeva Multilingual multi-document continuously updated social networks (Page 2/8) (company, country, research interests, etc.). The features used for the linking are typically user-provided. To our knowledge, the only tools that extract the underlying linking features fully automatically are called Connivence Maps by Pertinence Mining (Connivence 2007; based on English and French news) and Silobreaker, based on Elucidon software (Silobreaker 2007, English only), but the producers do not say how their technology works and it is not even clear whether the networks are manually edited. For related work on individual components of the presented system (Named Entity Recognition, name variant matching, dealing with highly inflected languages, etc.), see Steinberger & Pouliquen (2007). 3. The underlying news data and text analysis technology The social networks under discussion are extracted from live news, using resources on person names and their spelling variants. In this section, we briefly summarise where the news data comes from (section 3.1), how person names have been extracted across many languages and over years to build a name database of currently 615,000 names (3.2), how spelling variants for the same name have been gathered and merged automatically (3.3) and how morphological inflections of known names are being recognised in Balto-Slavonic and other highly inflected languages (3.4). Section 4 will then explain how this data is used to produce live social networks. 3.1 Gathering the news data The JRC s Europe Media Monitor system (Best et al. 2005) gathers an average of 35,000 news article per day in 32 languages, by continuously monitoring about 1,100 public news sites from around the world for newly published information. All new articles are downloaded, converted to the standard UTF-8-encoded XML news format RSS, full-text indexed and classified according to themes and the countries mentioned in the text. The result is published in the EMM-NewsBrief site (http://press.jrc.it), which is updated every ten minutes. 3.2 Multilingual Named Entity Recognition For 19 of the 32 languages, the related EMM- NewsExplorer application (http://press.jrc.it/newsexplorer, Steinberger et al. 2005) clusters all articles gathered during the previous day by similarity in order to group all articles about the same subject or event. For all clusters, references to geographical places, to persons and organisations are identified, using finite state automata to recognise known names and regular expressions to recognise new names or name variants (recognition of new names in 14 languages only: Da, De, En, Es, Et, Fr, It, Nl, No, Pt, Ro, Sl, Sv, Tr). Sequences of uppercase words are identified as being a name if they contain known first names or if they are surrounded by empirically collected lexical patterns consisting of titles (e.g. Minister), words indicating nationality (e.g. German), age (e.g. 32-year old), occupation (e.g. playboy), a significant verbal phrase (e.g. has declared), and more. We refer to these patterns generically as trigger words. Name stop words are used to exclude identifying frequent uppercase words (e.g. Monday) as part of the name. For a detailed description of this process, see Steinberger & Pouliquen (2007). The process does not make use of part-of-speech or other linguistic information in order to keep the process simple and so that it can easily be extended to many languages. 3.3 Name variant matching and merging For all unknown names found during the daily analysis, an approximate string matching algorithm checks whether the name is likely to be a variant of a known name or whether it is a new name. New names are added to the database with a new identifier. Names found in at least five different news clusters are added to the list of known names. Periodically, a search on Wikipedia (Wikipedia 2007) is carried out to gather name translations that can be found there, as well as photographs. Wikipedia is especially useful to find name transliterations in languages using different scripts, such as Asian languages or languages using the Cyrillic, Arabic or Hellenic scripts. The approximate string matching algorithm to compare newly found names with the 615,000 known names and their 143,000 known variants (status July 2007) is a multi-step process, details of which are described in Steinberger & Pouliquen (2007). To avoid a performance bottleneck when comparing each of several hundred new names per day with close to a million known names and name variants, we first apply a name normalisation step. Only if the normalised new name is identical with a normalised name (or any of its variants) in the database, we apply the edit distance approximate matching algorithm (Zobel & Dart 1995) to two different name representations: once to the normalised name form and once to the normalised name form with the vowels removed. If the average similarity for the new and the known name are above an empirically set threshold, the two names will be classified as variants of each other. Otherwise, the new name will be added to the database as a new name. The name normalisation rules eliminate diacritics, reduce two neighbouring identical consonants to single consonants, unify frequent spelling variants across languages, etc. For instance, the German name-initial Wl and the name-final ow for Russian names (as in Wladimir Ustinow) will get replaced by Vl and ov ; the Slovene š and the German sch will get replaced by sh ; French ou (as in Oustinov) will get replaced by u, etc. These normalisation rules are exclusively driven by pragmatic needs and have no claim to represent any underlying linguistic concept. An average of 400 new person names are automatically recognised as part of the NewsExplorer text analysis every day. The NewsExplorer database keeps track of all name mentions plus the list of trigger words (the titles and phrases) they are associated with. Pouliquen, Steinberger & Belyaeva Multilingual multi-document continuously updated social networks (Page 3/8) Lang NewsPaper Snippet sl vecer glavnega osumljenca za umor Aleksandra Litvinenka v Londonu postavili pred sl vecer v ponedeljek zavrnil izrocitev Andreja Lugovoja, da bi ga kot glavnega tr sabah öldürülen eski KGB ajani Alexander Litvinenko'nun davasi, Ingiltere-Rusya tr sabah cinayetin zanlisi olarak istedigi Andrei Lugovoy'u Rusya'nin iade etmemesi en dailytimespk suspected of killing Kremlin critic Alexander Litvinenko in London last year, en dailytimespk when British prosecutors alleged that Andrei Lugovoi used a rare radioactive pt DiariodeNoticias assassínio do ex-oficial do KGB Alexander Litvinenko. A revelação foi feita pt DiariodeNoticias acederia ao pedido de extradição de Andrei Lugovoi (outro ex-agente do KGB) en taipeitimes Kremlin following its refusal to extradite Andrei Lugovoi, the former KGB... en taipeitimes KGB agent suspected of murdering Alexander Litvinenko last November. en eirepost Lugovoi over the murder of Alexander Litvinenko, describing the decision en eirepost Russia's refusal to extradite Andrei Lugovoi over the murder of Alexander sl delo in nekdanjega tajnega agenta KGB Andreja Lugovoja. London - Britanija in sl delo in ostrega Putinovega kritika Aleksandra Litvinenka, ki je bil nekoc prav tako en rian - Russia considers the Alexander Litvinenko case a purely criminal matter, en rian Moscow has refused to extradite Andrei Lugovoi, a former Kremlin bodyguard, Table 1. Text snippets in newspapers of various languages showing both the names Alexandre Litvinenko and Andrei Lugovoi. 3.4 Dealing with morphological inflection The current list of known names, i.e. the names that were found in at least five independent news clusters, consists of approximately 50,000 names plus 135,000 variants. These names can be identified in text of any language through a simple lookup procedure, i.e. no lexical patterns are required. This works well for languages with little morphological proper noun variation (e.g. most Western European languages, Arabic, Bulgarian, etc.). However, for Balto-Slavonic, Finno-Ugric and other languages, looking up the base form of a name will yield poor results as the names will not be found when they are inflected. For instance, Estonian Bushiga and Slovene Bushom are both inflections of Bush. Table 1 shows some morphological variants of the names Alexander Litvinenko (e.g. Litvinenka, Litvinenko nun) and Andrei Lugovoi (Lugovoja, Lugovoy u). As acquiring or developing morphological resources for all 32 EMM languages is out of our reach, we use relatively simple, hand-crafted paradigm expansion rules that generate for each of the known names and their variants a number of morphological variants. These rules, described in more detail in Pouliquen et al. (2005), either add various name endings to the same name or they substitute endings to generate a set of new endings. For the name of the Secretary-General of the Council of the European Union Javier Solana, for instance, we generate various inflection forms so that the strings Javierja Solane (sl), Javierom Solanom (sk), Javierem Solaną (pl), Javierjem Solano (sl), Javiera Solany (pl) will all be found and identified as variants of Javier Solana. These morphological paradigm extension heuristics do not solve all problems, but the most frequent morphological variants can normally be captured and overgenerated (wrong) variants are not harmful as they will simply not be found. For the lookup procedure, we use FLEX (Paxson 1995) to produce a finite state automaton. This tool is useful for the efficient lookup of large name lists including character-level regular expressions for suffixes, etc. It also allows looking up person names in languages that do not use white space to separate words, such as Chinese. 4. Social network generation and visualisation The input for the work on social network generation consists of a stream of all incoming EMM news articles (32 languages), in which references to known persons have been marked up using the finite state automaton described in Section 3.4. Names not previously known are not recognised in the live system, but the list of known names is updated every day. For efficiency purposes, we build a constantly updated index that records, for each recognised name and for each pair of names, all 300-character text snippets around the names. Table 1 shows multilingual text snippets for the names Alexander Litvinenko and Andrei Lugovoi. The index is reset at midnight every day so that it always contains the latest news and name mentions. We plan to turn this index into a 24-hour rolling window from which articles older than 24 hours will get deleted. This will give more consistency to the networks shown and will be more useful for users living in different time zones. The index is read every two hours to update the graphs. 4.1 Turning links into a network When two names are mentioned in the same article, a link is created between these two names. The more frequently the two names are mentioned, the stronger the link. In the input example shown in Table 1, the tool will thus build a Pouliquen, Steinberger & Belyaeva Multilingual multi-document continuously updated social networks (Page 4/8) linked to person B and person B to person C, A and C will occur in the same sub-graph. The presence or nonpresence of drawn edges indicates whether the link is direct or indirect. Filtering Figure 1. Example of the result of graph filtering, we retain only persons having lots of links to others. link of weight (strength) 8 between the two persons Alexander Litvinenko and Andrei Lugovoi because they are mentioned together in eight different documents. We also have the possibility to filter graphs by language, country or subject area. This can be useful for users wanting to analyse news articles for a very specific domain. When considering only English documents, the link between Litvinenko and Lugovoi in Table 1 would be 4. Any set or subset of links can be used to build a graph with edges (links) and vertices (persons, nodes). When considering the co-occurrence relationship under discussion, links are obviously non-directional, but in other types of relationships, it may be necessary to show the direction of relations. In the criticism and support relationship extracted by Tanev (2007), for instance, links need to be directed because such relationships are not necessarily bidirectional, and in the quotation relationship (person A refers to person B in direct speech; Pouliquen et al. 2007), users may also want to see who makes reference to whom. These more specific types of relationships and the resulting social networks are not yet incorporated in the public version of NewsExplorer. Each sub-graph contains connected persons, but these persons may be connected indirectly, i.e. if person A is 4.2 Reducing the size of a network The network of links can grow rather big in any given 24-hour period. For all languages together, we find approximately 50,000 links per day. In order to reduce the amount of information to the stronger links, we reduce the size by setting a threshold on the number of links required for each name pair. Depending on the number of articles, we set this threshold to between 1 (for instance for graphs fed by single languages with less articles) and 4 (for the multilingual graph fed by all 32 EMM languages). We use a simple algorithm to divide the graphs into sets of sub-graphs, which may be connected or not. Using the previously mentioned threshold, the network of all links can be cut down into sub-graphs, i.e. graphs are automatically separated if they either have no links or if the link strength is lower than the threshold. However, if there are links above the threshold, both graphs will be joined into one. Towards the end of a 24-hour period, the graphs are often illegibly large (see, for instance, the first graph in Figure 1). This might be an indication that 12-hour windows may be more appropriate. For practical reasons, we display only the first 12 biggest sub-graphs. For visualization purposes, we further reduce the size of graphs if the total number of persons (nodes, or vertices) is larger than 120. We do this by deleting those persons having only one link and loop until no more vertices can be deleted. If the remaining number of edges is above 140, we remove the persons having more than two edges. If the remaining number of edges is more than 160, we delete the ones having three edges, and so on, according to the following algorithm: threshold=1; min=100; While (numberofvertices min+20*threshold) { Do { DeleteVertexHavingLessThan(threshold); } until (novertexdeleted); threshold += 1; }; 4.3 Visualising the network The
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks