Automatic extraction of show metadata
Context
The Culture Crée develops tools and services to enhance the findability and discoverability of shows in the consumer marketplace. Footlight’s technology harvests unstructured or semi-structured information about shows from websites and translates it into structured machine-readable metadata.
The Culture Crée’s aspiration is to develop a data graph that brings together all of Canada’s performing arts works, based exclusively on information from primary sources (i.e., producers and presenters of shows) and by linking them to existing data graphs, already widely used on the Web. It aims to provide cultural organizations with the tools they need to better respond to issues of discoverability, to pool data from the field and to allow them to reappropriate the discourse on their works.
Objectives
In order to optimize the work time and to address scalability issues, we collaborated with Culture Crée to integrate artificial intelligence in the information extraction process. The objective of the mandate was to explore the potential of an automatic extraction model of information on artists, places and dates of events from url links, through the realization of a prototype.
Methodology
We performed the entire data cycle, from defining the business angle to prototyping the tool, which included the steps :
- Problem identification
- Data identification and diagnosis
- Data transformation ( html file format to text )
- Development, testing and selection of the best model (Development of two models : named entity recognition model (automatic natural language processing) and relevance score model (machine learning)
Results
The entity recognition model performs well since it allows to capture 70.4% of the main artist entities of an event, reducing the work time for several days for a single organization to only a few minutes, no matter how many shows or organizations are to be processed. It also allows for the identification of other, sometimes secondary, entities – something that was not possible before, since the traditional method simply compared the words to an existing database. However, a human post-model verification is still necessary to validate the results of the model. The existence of the relevance score facilitates this work by allowing the user to play with the tool according to his needs and to reduce the list of entities found by the model as required.