Efficiently identifying disguised nulls in heterogeneous text data - Laboratoire d'informatique de l'X (LIX) Accéder directement au contenu
Communication Dans Un Congrès Année : 2021

Efficiently identifying disguised nulls in heterogeneous text data

Théo Bouganim
Ioana Manolescu
Helena Galhardas
  • Fonction : Auteur
  • PersonId : 1109600

Résumé

Digital data is produced in many data models, ranging from highly structured (typically relational) to semi-structured models (XML, JSON) to various graph formats (RDF, property graphs) or text. Most real-world datasets contain a certain amount of null values, denoting missing, unknown or unapplicable information. While some data models allow representing nulls by special tokens, socalled disguised nulls are also frequently encountered: these are values that are not syntactically speaking nulls, but which do, nevertheless, denote the absence, unavailability or unapplicability of the information. This paper describes our ongoing work toward detecting disguised nulls in textual data, encountered in ConnectionLens graphs. Driven by journalistic applications, we focus for now on large, semistructured datasets, where most or all data values are freeform text. We show that the state-of-the-art methods for detecting nulls in relational databases, mostly tailored towards numerical data, do not detect disguised nulls efficiently on such data. Then, we present two alternative methods: (i) leveraging Information Extraction, and (ii) text embeddings and classification. We detail their performance-precision trade-offs on real-world datasets.
Fichier principal
Vignette du fichier
Efficiently_identifying_disguised_nulls_in_heterogeneous_text_data.pdf (1.2 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03347947 , version 1 (17-09-2021)

Identifiants

  • HAL Id : hal-03347947 , version 1

Citer

Théo Bouganim, Ioana Manolescu, Helena Galhardas. Efficiently identifying disguised nulls in heterogeneous text data. BDA (Conférence sur la Gestion de Données – Principles, Technologies et Applications), Oct 2021, Paris, France. ⟨hal-03347947⟩
74 Consultations
179 Téléchargements

Partager

Gmail Facebook X LinkedIn More