Skip to main content

REVIEW article

Front. Artif. Intell., 23 June 2023
Sec. Machine Learning and Artificial Intelligence

A review of data abstraction

  • Dipartimento di Ingegneria Informatica, Automatica e Gestionale “A. Ruberti”, Sapienza University of Rome, Rome, Italy

It is well-known that Artificial Intelligence (AI), and in particular Machine Learning (ML), is not effective without good data preparation, as also pointed out by the recent wave of data-centric AI. Data preparation is the process of gathering, transforming and cleaning raw data prior to processing and analysis. Since nowadays data often reside in distributed and heterogeneous data sources, the first activity of data preparation requires collecting data from suitable data sources and data services, often distributed and heterogeneous. It is thus essential that providers describe their data services in a way to make them compliant with the FAIR guiding principles, i.e., make them automatically Findable, Accessible, Interoperable, and Reusable (FAIR). The notion of data abstraction has been introduced exactly to meet this need. Abstraction is a kind of reverse engineering task that automatically provides a semantic characterization of a data service made available by a provider. The goal of this paper is to review the results obtained so far in data abstraction, by presenting the formal framework for its definition, reporting about the decidability and complexity of the main theoretical problems concerning abstraction, and discuss open issues and interesting directions for future research.

1. Introduction

Despite the increasing centrality of data in AI, the way in which AI deals with data has remained virtually unchanged since the dawn of the discipline. This has to be contrasted with the well-known fact that Artificial Intelligence (AI), and in particular Machine Learning (ML), is not effective without good data preparation, as also pointed out by the recent wave of data-centric AI. The term “data centric” refers to an architecture where data is the primary and permanent asset. So, data preparation precedes the implementation of any given machine learning task, and can potentially support many of such tasks relying on the same domain. More specifically, data preparation is the process of gathering, transforming and cleaning raw data prior to processing and analysis. It is therefore regarded as an important step in any data engineering and data science projects, including machine learning, involving tasks such as understanding, collecting and reformatting data, aggregating, integrating, combining and enriching raw source data and making modifications and corrections in order to meet quality standards.

The first activity of data preparation requires collecting data from suitable data sources and data services, often distributed and heterogeneous. In the era of data as driving asset both for the private and public domain, the availability of services providing data, also called data services, is indeed growing incredibly fast. Thus, on one hand, more and more data services are available, on the other hand, more and more AI tasks and applications rely on data services. This scenario opens two crucial issues for data-centric AI. First, from a consumer point of view, how to find the “right” data, i.e., data which properly respond to an information need? Second, from a provider point of view, how to release FAIR-compliant data services, i.e., services automatically Findable, Accessible, Interoperable, and Reusable (FAIR)? An effective answer to the former question is given by exploiting the state of the art technology for answering queries over data integration systems, which stems from more than thirty years of research. As for the second question, an answer is given by the results on a relatively new service of data integration systems, called abstraction. In order to elaborate more on both these answers, let us first make a step back to data integration.

Data integration is the problem of providing a unified and reconciled view of the data stored in a set of autonomous and heterogeneous sources. The theoretical works on data integration systems have advocated a three-layer architecture comprising the data sources, which in our setting are the output of the data services, the global schema, which is a unified shared conceptualization of the domain of interest, and the mapping between the sources and the global schema. Formally, a data integration system is a triple J=G,S,M, where G is the global schema, S is the source schema and M is the mapping, i.e., a set of logical assertions describing how the data at the sources relate to the elements of the global schema. Then, intuitively, given a set of data sources D, J represents all the (possibly incomplete) databases that are instances of G satisfying M w.r.t. D.

Once data services have been integrated by means of a data integration system specified through a triple J=G,S,M, in order to find the “right data,” a data service consumer can rely on query answering. Specifically, by unambiguously expressing an information need as a query over the shared vocabulary of G, he can get the answers that “best correspond” to his need without even having to know the relevant data services. In particular, in most approaches, such answers have been identified as certain answers, i.e. answers to qG that would be returned by every database represented by J given a set of data sources D. Also, typically, such answers are computed by first reformulating qG in terms of a query qS and then by evaluating qS over D. Conversely, in order to make a data service FAIR-compliant, a provider can rely on abstraction over J. Specifically, given a data service originally expressed as a query over a set of data sources, he can get a query over the shared vocabulary of G, that unambiguously describes the data service content, thus making it both accessible, interoperable and reusable. Concretely, given a query qS over the data sources, he would get a query qG over the global schema whose answers “best correspond” to the data service. Obviously, also for abstraction, the meaning of “best correspond” has to be made precise. Ideally, the query qG is the one whose certain answers are exactly the answers of qS, for every possible source database. Such a query qG is called perfect J-abstraction of qS.

We next use an example for informally introducing and illustrating the main notions related to abstraction. In the example, we focus on queries that are conjunctions of atoms, called conjunctive queries (CQ), and unions thereof, called unions of conjunctive queries (UCQ), and we assume that the evaluation of a query expressed over the global schema is based on the certain answer semantics.

Example 1. Let J=G,S,M be a data integration system where the elements of the source schema S are the predicates (with associated arity) {s1/1, s2/2, s3/1, s4/1, s5/2}, the elements of the global schema G are {g1/2, g2/1, g3/2, g4/2, g5/1}, and M contains the following assertions (where the free variables are implicitly universally quantified):

m1:s1(x)y.g1(x,y)m2:s2(x,y)g1(x,y)m3:s3(x)s4(x)g2(x)m4:s3(x)s2(x,y)g3(x,y)m5:z.s5(y,z)s2(x,y)g3(x,y)m6:s5(x,y)g4(x,y)m7:s1(x)s4(x)g5(x)

Consider the query qS1={x,ys2(x,y)}. It is easy to see that, for every database D, the set of certain answers of qG1={x,yg1(x,y)} coincides with the set of answers of qS1 w.r.t. D. It follows that the CQ qG1={x,yg1(x,y)} is a perfect J-abstraction of qS1.

Consider the query qS2={xy.s2(x,y)}. A natural candidate for the perfect J-abstraction of qS2 is qG2={xy.g1(x,y)}. Note, however, that the certain answers to qG2 include tuples in s1 that may not belong to s2, and therefore qG2 is not even a sound J-abstraction of qS2 (i.e., it does not retrieve only tuples of qS2). Indeed, it can be shown that no UCQ exists that is a perfect J-abstraction of qS2. However, the query asking for those x such that g1(x, y) is known to be true, i.e., holds in every model of J, cannot exploit mapping m1, and therefore avoids retrieving tuples from s1. It follows that such query, which is not expressible as a UCQ, is a perfect J-abstraction of qS2. Consider the query qS3={xs1(x)}. Again, the natural candidate for the perfect J-abstraction of qS3 is clearly qG2={xy.g1(x,y)}. However, because of m2, the certain answers to qG2 also include the values in the first component of s2, and this means that qG2 is not a sound J-abstraction of qS3, although it is a complete one (i.e., it retrieves all tuples of qS3). Another possible candidate is the query qG3={xy.g5(x)}. However, this query captures only the tuples occurring in s1 which also occur in s4. It follows that qG3 is a sound J-abstraction, although not a complete one. Actually, it can be shown that no perfect J-abstraction of qS3 exists in the class UCQ, but qG2 and qG3 are, respectively, the minimally complete and the maximally sound J-abstraction of qS3 in the class UCQ.

Consider now the query qS4={()x,y.s5(x,y)s3(x)}, and assume that we aim at checking whether its perfect J-abstraction can be expressed as a UCQ. We immediately observe that {()∣∃x, y.g4(x, y)∧g2(x)} is a sound J-abstraction of qS4. Also, we can easily verify that {()∣∃x, y, x1.g4(x, y)∧g3(x, x1)∧g2(x1)} is also sound, and may retrieve tuples that are not retrieved by {()∣∃x, y.g4(x, y)∧g2(x)}. More generally, all queries of the form {()∣∃x, y, x1, …, xn.g4(x, y)∧g3(x, x1)∧…∧g3(xn−1xn)∧g2(xn)}, for n≥1, are pairwise incomparable sound J-abstractions of qS4. Based on this observation, one can show that there exists no maximally sound J-abstraction of qS4 in the class UCQ. However, the following Datalog query (with goal Ans) is the maximally sound J-abstraction of qS4 in the whole class of monotone queries:

g3(x,y)                                              t1(x,y)  t1(x,y)t1(y,z)                           t1(x,z)                                     Δg4(x,y)g2(x)                            Ans()  g4(x,y)t1(x,z)g2(z)       Ans()

We point out that, apart from the scenario of data services providers, data abstraction is relevant in several other contexts. We mention three of them here. In the context of ontology-based data management, abstraction can be used to check whether the mapping provides the right coverage for expressing the relevant data services at the global schema level (Lutz et al., 2018). Also, abstractions can provide the semantics of open datasets and open APIs published by organizations, which is a key aspect for unchaining all the potentials of open data (Cima et al., 2017). Finally, abstraction can be the basis for a semantic-based approach to source profiling (Abedjan et al., 2017), again one of tasks of data preparation, in particular for describing the structure and the content of a data source in terms of the business vocabulary.

The goal of this paper is to review the main notions and results about abstraction. We present the formal framework for its definition, and report about the decidability and complexity of the main theoretical problems concerning abstraction, i.e., verification, existence, and computation. The roadmap of the paper is as follows:

• Section 2 introduces some relevant background about databases, queries, and data integration.

• Section 3 illustrates the formal framework for abstraction in data integration by providing some of the key definitions used throughout the paper.

• Section 4 reports results appearing in Cima et al. (2021) on the relationship between abstraction and another well-studied problem, namely view-based query processing (see, e.g., Halevy, 2001). The latter is the problem of answering a query over a schema S in terms of a set of materialized views over S. Interestingly, the established relationship between abstraction and view-based query processing sheds into light new results about both problems.

• Section 5 illustrates results related to the problem of computing best UCQ abstractions of UCQ source queries (Cima et al., 2019). The main results are that, while minimally complete abstractions are guaranteed to exist, this is not the case for maximally sound abstractions. Motivated by the latter result, a restricted scenario is introduced, in which the existence of maximally sound abstractions is always guaranteed.

• Section 6 surveys results on computing best monotone abstractions of UCQ source queries (Cima et al., 2022). The principal contributions are the definition of a novel monotone query language (in the context of data integration) and the discussion of how such a language is able to express all forms of the best monotone abstractions (perfect, or approximated).

• Section 7 presents results on computing abstractions of UCQ source queries in a specific, well-known non-monotone query language (Cima et al., 2020). The main results are that all forms of best abstractions are not guaranteed to exist in such a language, and, in virtue of this result, two interesting restricted scenarios are investigated.

• Finally, Section 8 concludes the paper by discussing possible future research on abstraction.

2. Preliminaries

2.1. Databases and queries

We assume a denumerable set of constant symbols C that is included in every alphabet that we shall consider. A database schema (or simply schema) T is a logical theory, i.e., a finite set of logical axioms, over an alphabet AT of predicate symbols and constants from C. A T-database is simply a model of T, i.e., an interpretation for AT that satisfies all the axioms of T, with the additional requirements that (i) the domain of D is C, (ii) every constant is interpreted into itself, and (iii) the extention of every predicate is finite.1 In what follows, we will often see a T-database as a finite set of ground facts over AT, each of which corresponding to a tuple in the extension of the associated predicate.

As customary, a database query over a schema T of arity n, or simply an n-ary T-query, is a function associating to each T-database a finite set of tuples of constants of arity n. Often, however, it is more convenient to specify queries using expressions from some formal language to which a semantics, i.e., an actual query function, is associated. In what follows, whenever we talk about a query language L, we mean the class of all queries that can be expressed using L and its associated semantics.

A fundamental query language for our work is the language of First-Order Logic (FOL) queries. A FOL query q for a schema T is a T-query defined by an expression of the form {x̄ϕ(x̄)}, where x̄ is a tuple of variables, called the distinguished variables of q, and ϕ(x̄) is a FOL formula over alphabet of T containing all the variables in x̄. The arity of q is the arity of x̄, and we will often use q(x̄) to say that x̄ are the free-variables of the FOL query q and write {x̄ȳ.ϕ(x̄,ȳ)} simply as ϕ(x̄). Moreover,we will use the predicate ⊤ to form atoms of any arity; such atoms will always be interpreted as true. Given a T-database D and a FOL T-query q of arity n, qD is the set of all tuples c̄Cn such that Dϕ(c̄).

A conjunctive query (CQ) q over a schema T is a FOL query of the form {x̄ȳ.ϕ(x̄,ȳ)}, where ȳ is a tuple of variables, called the existential variables of q, and ϕ(x̄,ȳ) is a finite conjunction of relational atom. Given a CQ q={x̄ȳ.ϕ(x̄,ȳ)}, we say that an existential variable y∈ȳ is a join existential variable of q if it occurs more than once in the atoms of ϕ(x̄,ȳ). In what follows, we say that a CQ q is a conjunctive query with join-free existential variables (CQJFE) if there is no join existential variable occurring in q.

Other classes of database queries considered in this paper are defined as customary in terms of both syntax and semantics. An atomic query is a FOL query where ϕ(x̄) consists of a single relational atom. A union of conjunctive queries (UCQ) (resp., union of conjunctive queries with join-free existential variables (UCQJFE)) is a query defined as a finite union of CQs (resp., CQJFEs) having the same arity, called its disjuncts, and its semantics is defined via the associated FOL query. For the definition of Datalog, Disjunctive Datalog, and Disjunctive Datalog with inequalities (denoted by DD), we refer the reader to Eiter et al. (1997).

2.2. Querying sets of databases

In what follows, we will often need to extend the notion of database queries to sets of databases. A generalized T-query of arity n is a function associating to each set of T-databases a finite set of n-tuples of constants in C, called the answers of q for Σ and denoted qΣ. As customary, for two T-queries q1 and q2, we write q1q2 if q1  q2 for each set Σ of T-databases, and we write q1q2 if both q1q2 and q2q1.

A common method to define a generalized T-query is to lift the semantics of a T-query to sets of T-databases using the notion of certain answers. Given a T-query q and a set Σ of T-databases, the certain answers of q over Σ are defined as DΣqD. Thus, in what follows, we consider that every generalized T-query is such that given a set Σ of T-databases, qΣ is the set of the certain answers of q over Σ. This small abuse of notation and the observation that qD = q{D} allow us to blur the distinction between queries and generalized queries. Therefore, from now on, unless otherwise specified, we will use the term T-query for generalized T-query.

2.3. Data integration

A data integration system (Lenzerini, 2002) J is specified by a triple G,S,M, where G, the global schema, is a schema over an alphabet AG, S, the source schema, is a schema over an alphabet AS (disjoint from AG, except for the set C), and M is a mapping relating S to G. Specifically, M is a finite set of assertions of the form qSqG, where qS is an S-query and qG is a G-query of the same arity as qS.

The semantics of J is defined relative to an S-database D, and, intuitively, is the set of all the G-databases that satisfy M with respect to D. A G-database B satisfies M with respect to D, denoted by (D,B)M, if it satisfies all the assertions in M, i.e., qSDqGB for each (qSqG)M.

Formally, the semantics of J relative to D, denoted as mod(J,D), is defined as {B|Bis aG-database such that(D,B)M}. We say that D is consistent with J if mod(J,D). The answers to a G-query q w.r.t. a data integration system J=G,S,M and an S-database D is simply qmod(J,D), that we often write simply as qJ,D. For two G-queries q1 and q2, we write q1J q2 if q1J,Dq2J,D for each S-database D; q1Jq2 and J-equivalence are defined accordingly.

Specific classes of mappings considered in the literature are GAV, LAV, GLAV, PGAV, and SPGAV. We introduce them under the assumption that the queries appearing in mapping assertions are conjunctive queries or restricted forms thereof.

A GLAV mapping is a set of assertions of the form qS(x)qG(x), where both qS and qG are conjunctive queries over S and G respectively, with distinguished variables x.

A GAV mapping is a special case of GLAV, constituted by a set of assertions of the form qS(x)A(x), where (i) qS is a conjunctive query over S and (ii) A(x) is an atomic T-query. A pure GAV mapping (PGAV) is a GAV mapping in which each assertion qS(x)A(x) is such that no repeated variables appear in x. A PGAV mapping is called SPGAV (PGAV with single assertion per predicate) if it does not contain a pair of assertions with the same predicate symbol the right-hand side.

A LAV mapping is a special case of GLAV, constituted by a set of assertions A(x)qG(x), where (i) A(x̄) is an atomic T-query and (ii) qG is a conjunctive query over G with distinguished variables x.

In what follows, we implicitly refer to a data integration system J=G,S,M, and when we denote a query by qG (resp., qS) we mean that the query is a G-query (resp., S-query).

2.4. The EQL-Lite(UCQ) language

EQL-Lite(UCQ) is a powerful query language in the context of data integrations systems introduced and studied in Calvanese et al. (2007a). An EQL-Lite(UCQ) T-query q is an expression of the form q={x̄φ(x̄)} where φ(x̄) is an EQL formula built according to the following syntax:

φ(x¯)::=Kϱy.φφ1φ2φ1φ2¬φ

with ϱ being a disjunction of conjunction of relational atoms over T possibly involving existentially quantified variables. The semantics is based on the notion of satisfaction of EQL sentences w.r.t. epistemic interpretations, which are pairs E,I with E being a set of interpretations and IE. We now inductively define when an epistemic interpretation E,I satisfies an EQL sentence φ, written E,I φ:

E,IP(c)        if     IP(c)E,Iφ1φ2   if     E,Iφ1 and E,Iφ2E,I¬φ           if     E,IφE,Ix,φ       if      E,Iφcx for some constant cE,IKφ          if      E,Iφ for every IE,

Then, the answers qJ,D of an EQL-Lite(UCQ) query q={x̄φ(x̄)} w.r.t. a data integration system J=G,S,M and an S-database D are those tuples c̄ of constants such that mod(J,D),Bφ(c̄) for every Bmod(J,D).

Example 2. Consider Example 1 and suppose we are interested in asking for all x such that there exists y such that we know (x, y) belongs to g1. This can be expressed in EQL-Lite(UCQ) as follows:

qG6={xy.K(g1(x,y)}

Note that the query qG6 is different from the query asking for all x such that we know there exists y such that (x, y) belongs to g1, which is expressed as follows:

qG7={xK(y.g1(x,y)}

Indeed, while it can be verified that the answers to qG6 over J coincide with the answers to the query qS2={xy.s2(x,y)}, the answers to qG7 over J coincide with the answers to the query qS5=qS2{xs1(x)}.                           ⃤

3. Framework

We proceed to introduce the notion of query abstraction following Cima et al. (2019) for the basic definitions. We say that qG is a perfect J-abstraction of qS if qGJ,D=qSD, for each S-database D consistent with J. Clearly, if a perfect J-abstraction of qS exists, then it is unique up to J-equivalence, i.e., if q′ is a perfect J-abstraction of qS then q=JqG. Therefore in the following we will talk about the perfect J-abstraction of qS.

Example 3. Consider Example 1. It is easy to verify that qG1 is the perfect J-abstraction of qS1.                           ⃤

The following theorem presents a preliminary characterization of the existence of perfect J-abstractions.

Theorem 1. [(Cima et al., 2021, Theorem 1)] There exists a perfect J-abstraction of qS if and only if for all pair D, D ′ of S-databases, mod(J,D)=mod(J,D) implies qSD=qSD.

As the condition of being a perfect J-abstraction of source query is rather strong one, it might be very well the case that such a global schema query may not exist.

Example 4. Consider again Example 1. Using Theorem 1, we can show that there exists no perfect J-abstraction for qS4. In fact, for the databases D = {s5(a, b)} and D={s5(a,b),s3(a)}, we have mod(J,D)=mod(J,D) but DqS5 while DqS5.                           ⃤

In these cases, it is reasonable to consider weaker notions, such as sound or complete approximations of perfectness. We say that qG is a complete J-abstraction of qS if qSDqGJ,D, for each S-database D consistent with J. Similarly, we say that qG is a sound J-abstraction of qS if qGJ,DqSD, for each S-database D consistent with J. Obviously, one is interested in complete or sound abstractions that approximate qS at best, at least in the context of a specific class of queries. If LG is a class of queries, we say that a global schema query qGLG is an LG-minimally complete J-abstraction of qS if qG is a complete J-abstraction of qS and there is no global schema query qGLG such that qG is a complete J-abstraction of qS and qGJqG. Similarly, we say that a global schema query qGLG is an LG-maximally sound J-abstraction of qS if qG is a sound J-abstraction of qS and there is no global schema query qGLG such that qG is a sound J-abstraction of qS and resp., qGJqG.

Example 5. Consider again Example 1. Queries qG2 and qG3 are, respectively, the UCQ-minimally complete and UCQ-maximally sound J-abstraction of qS3.                           ⃤

Depending on the chosen language LG, it may be the case that no LG-minimally complete or LG-maximally sound J-abstraction exists (see again Example 1 for some concrete cases). Moreover, even if one such abstraction exists, it may not be unique. For some classes Q of queries, however, one can show that a Q-maximally sound (resp., Q-minimally complete) J-abstraction of qS exists, then it is unique up to J-equivalence. This is the case, for example, of the class of UCQs for which, if a UCQ-maximally sound (resp., UCQ-minimally complete) J-abstraction of exists, then it is unique up to J-equivalence. Thus, in the following, we simply talk about the UCQ-maximally sound and the UCQ-minimally complete J-abstraction of a source query qS. Other classes of queries with this properties will be introduced in the subsequent sections.

In the next sections, we will study J-abstraction for data integration systems of a specific form, namely where (i) the mapping is of type GLAV or special cases of GLAV, and (ii) if not otherwise stated, the set of axioms of both the global schema and the source schema is empty. Also, we will limit our analysis to abstractions of UCQ source queries.

4. View-based query processing and query abstraction

It is well-known that there is a relationship between data integration and view-based query processing, grounded on the idea that the sources of a LAV data integration systems can be considered as views defined over the global schema, in particular sound views (Lenzerini, 2002). In this section, we take another approach and establish a relationship between GAV data integration systems and views, based on the idea that the elements of the global schema can be considered as views defined over the source schema.

This section is organized as follows. We first recall the basic notions about view-based query processing. Then, in Section 4.1 we make clear the relationship between GAV data integration systems and views, while in Section 4.2 we establish the connection between abstractions and rewriting queries using views. Finally, in Sections 4.3 and 4.4 we use the above connection to introduce results for abstraction and view-based query processing, respectively. All the results presented in this section appear in Cima et al. (2021).

View-based query processing is a general term denoting several tasks related to the presence of views in databases. A set of views V over a schema T is constituted by a finite set of view predicate symbols, where each VV has a specific arity, and an associated view definition VT, i.e., a query over T of the same arity of V. An extension of a view V is simply a set of facts for V, and a V-extension E is constituted by an extension for each view in V. Given a T-database D, we denote by V(D) the V-extension {V(c̄)VVandc̄VTD}. In what follows, we use the term L views to indicate a set of views in which all view definitions are queries expressed in the query language L.

Two particular notions have been subject to extensive investigations in the view-based processing literature, namely view-based query rewriting and view-based query answering (Calvanese et al., 2000, 2007b).

In the former notion, originated in Levy et al. (1995), we are given a query qT over a schema T and a set of views V over T, and the goal is to reformulate qT into a query qV, called a V-rewriting, in terms of the view predicate symbols of V. We obtain different variants of V-rewritings depending on the relationship between qT and qV we aim at. We call qV (i) a V-rewriting of qT under exact views, or simply V-rewriting of qT, if for every T-database D it holds that qVV(D)qTD, (ii) an exact V-rewriting of qT if for every T-database D it holds that qVV(D)=qTD. Note that, if we fix a specific query language LV for expressing V-rewritings, we might lose power in expressing V-rewritings. In this case, a reasonable goal is to compute V-rewritings expressible in LV that are “maximal” in the class LV. Formally, we say that a query qVLV is an LV-maximal V-rewriting of qT, if (i) qV is a V-rewriting of qT; and (ii) there is no q1LV such that (a) q1 is a V-rewriting of qT, (b) qVV(D)q1V(D) for each T-database D, and (c) there is a T-database D for which qVV(D)q1V(D).

As argued in Nash et al. (2010), given qT and V, the problem of checking whether there exists an exact V-rewriting of qT (called losslessness with respect to rewriting Calvanese et al., 2007b) is equivalent to the problem, called view determinacy (Nash et al., 2010), of checking whether qT is determined by V, denoted VqT, i.e., whether V(D1)=V(D2) implies qTD1=qTD2 for each pair of T-databases D1 and D2. Indeed, on the one hand, if VqT, then the function qV associating to each V(D) the tuples qTD, for each T-database D, is an exact V-rewriting of qT, on the other hand, if V↠̸qT, then such qV is not a function, and hence an exact V-rewriting of qS cannot exist.

In the view-based query answering, originated in Duschka and Genesereth (1997), besides qT and V we are also given a V-extension E, and the goal is to compute the so-called certain answers of qT w.r.t. V and E, denoted by certqT,VE, which are those tuples of constants c̄ such that c̄qTD for each S-database D satisfying EV(D). We denote by certqT,V the query over V that, for every V-extension E, computes the certain answers of qT w.r.t. V and E, and we call certqT,V the perfect V-rewriting of qT under sound views, or simply perfect V-rewriting of qT.

4.1. View-based query processing and data integration

We start by describing how to obtain, from any data integration system J with PGAV mapping, a suitable set of UCQ views2 VJ, and, viceversa, from any set of UCQ views V, a suitable data integration system JV with PGAV mapping.

For a data integration system J=G,S,M with MPGAV, the set of UCQ views VJ is such that (i) the set of view symbols coincides with AG, and (ii) for each view symbol g, the associated view definition gS is the following UCQ over S:

{x1̄y1̄.ϕS1(x1̄,y1̄)}{xl̄yl̄.ϕSl(xl̄,yl̄)},

where we have one disjunct yī.ϕSi(xī,yī) for each mapping assertion in M of the form yī.ϕSi(xī,yī)g(xī). Note that, if MSPGAV, then all view definitions in VJ are CQs.

Example 6. Let J=G,S,M be a data integration system such that M={m1,m2,m3} with:

m1:y1,y2.s1(y1,x,x)s2(x,y2,y2)g1(x)m2:y1,y2,y3.s1(y1,x1,x2)s2(x2,y2,y3)g2(x1,x2)m3:y1.s3(x1,x2,y1)g2(x1,x2)

Then, the UCQ views VJ over S is VJ={g1,g2}, where g1S={xy1,y2.s1(y1,x,x)s2(x,y2,y2)} and g2S={x1,x2y1,y2,y3.s1(y1,x1,x2)s2(x2,y2,y3)}{x1,x2y1.s3(x1,x2,y1)}.                                  ⃤

For a set of UCQ views V over a schema S, the data integration system JV=G,S,M is such that (i) AG coincides with the view predicate symbols in V, (ii) G has no axiom, and (iii) M is defined as follows: for each view symbol VV and for each CQ {x̄ȳ.ϕS(x̄,ȳ)} that is a disjunct in the UCQ VS, the mapping M includes a mapping assertion of the form: ȳ.ϕS(x̄,ȳ)V(x̄). Note that, in general, MPGAV. However, if V is a set of CQ views, then MSPGAV.

Example 7. Let V={V1,V2} be a set of UCQ views over S such that: V1S={x1,x2,x3s3(x1,x2,x3)}  {x1,x2,x3y.s1(x1,y)s2(y,x2,x3)}  {x1,x2,x3y1,y2.s1(x1,y1)s4(y2,x2,x3)} and V2S={x1,x2,x3,x4y.s1(x1,x2,y)s3(y,x3,x4)}.

Then, the data integration system is JV=G,S,M, where AG={V1,V2} and M={m1,m2,m3,m4} with:

m1:s3(x1,x2,x3)V1(x1,x2,x3),m2:y.s1(x1,y)s2(y,x2,x3)V1(x1,x2,x3),m3:y1,y2.s1(x1,y1)s4(y2,x2,x3)V1(x1,x2,x3),m4:y.s2(x1,x2,y)s4(y,x3,x4)V2(x1,x2,x3,x4).

                                                                    ⃤

For a data integration system J with PGAV mapping and a set of UCQ views V, the pair (J,V) is said to be coherent if (i) the schema over which the set of views V is defined and the source of J coincide, and (ii) J=JV or V=VJ. In what follows, when we talk about a coherent pair (J,V), we use S to denote the common schema between J and V.

Based on the relationship between JV and VJ, the following proposition provides a connection between existence of perfect abstractions and existence of exact rewritings.

Proposition 1. [(Cima et al., 2021, Proposition 1)] If (J,V) is a coherent pair and qS is an S-query, then there exists a perfect J-abstraction of qS if and only if there exists an exact V-rewriting of qS.

4.2. Abstractions and rewritings of DD

We now turn our attention to a concrete class of queries, namely DD. From now on, when we use L, we refer to a sublanguage of DD. By exploiting well-known results, we provide connections between the notion of J-abstractions and V-rewritings in the context of DD and its sublanguages. To this end, we first introduce some terminology.

Given a mapping PGAV relating S to G and a G-query q in a certain query language L, the M-unfolding of q (Lenzerini, 2002), denoted by unfM(q), is the S-query obtained by replacing each atom α occurring in the expression corresponding to q by the logical disjunction of all the left-hand sides of the mapping assertions in M having the predicate symbol of α in the right-hand side (being careful to use unique variables in place of those variables that appear in the left-hand side of the mapping assertions but not in the right-hand side of those).

Given a set of UCQ views V over S and a V-query q in a certain query language L, the V-expansion of q (Levy et al., 1995), denoted by expV(q), is the S-query obtained by replacing each atom α occurring in in the expression corresponding to q by the view definition associated to the view predicate name of α (again, being careful to use unique variables in place of those variables that appear in the bodies of the view but not in the heads of those).

Proposition 2. [(Cima et al., 2021, Proposition 2)] If (J,V) is a coherent pair, qS is an S-query in L, and q is a query in L, then q is a sound (resp., perfect) J-abstraction of qS if and only if q is a V-rewriting (resp.,exact V-rewriting) of qS.

Actually, as shown in Duschka and Genesereth (1998, Lemma 1), if L allows for the union operator, then for any pair of UCQ views V over S and query qSL over S, if an L-maximal V-rewriting of qS exists, then it is unique up to V-equivalence, and, moreover, it coincides with the perfect V-rewriting of qS.3 From Proposition 2 and the above observation, we can derive the following result.

Corollary 1. [(Cima et al., 2021, Corollary 1)] If (J,V) is a coherent pair and L allows for the union operator, then for every pair of queries qS,qL, we have that q is the L-maximally sound J-abstraction of qS if and only if q is the perfect V-rewriting of qS.

By exploiting the above provided relationships, we are now ready to investigate how results and techniques from the view-based query processing literature can be directly translated into results and techniques in the context of abstraction, and viceversa.

4.3. From view-based query processing to abstraction

By combining Proposition 1 with a well-known undecidability result about view determinacy, we can derive a negative result about an arguably fundamental problem for the notion of abstraction, namely the existence problem (with no restrictions on the query language to express perfect abstractions) of perfect abstractions, even in very restricted settings.

Theorem 2. [(Cima et al., 2021, Theorem 2)] Given a data integration system J=G,S,M with MSPGAV and a CQ S-query qS, checking whether there exists a perfect J-abstraction of qS is undecidable.

By exploiting Corollary 1, we now illustrate how to use off-the-shelf algorithms for rewriting queries in the presence of views as algorithms for computing abstractions. By results of Levy et al. (1995), for CQ views V, perfect V-rewritings of UCQs qS can be always expressed as UCQs, and can be always computed [e.g., by means of the bucket algorithm (Levy et al., 1996) or the MiniCon algorithm (Pottinger and Halevy, 2001)]. Thus Corollary 1 implies that, given a data integration system J=G,S,M with MSPGAV and a UCQ S-query qS, we can compute the UCQ-maximally sound J-abstraction of qS as follows: (i) compute VJ, and (ii) compute and return the UCQ corresponding to the perfect VJ-rewriting of qS.

Corollary 2. [(Cima et al., 2021, Corollary 2)] If J is a data integration system with SPGAV mapping and qS is a UCQ S-query, then the UCQ-maximally sound J-abstraction of qS exists and is computable.

Things get more complicated when we consider a data integration system J with PGAV mappings, which are clearly more expressive than SPGAV, for which VJ is a set of UCQ views, rather than CQ views. Indeed, for UCQ views V, UCQ-maximal V-rewritings of CQs qS are not guaranteed to exist (Duschka and Genesereth, 1998; Afrati and Chirkova, 2019), and thus, in general, perfect V-rewritings of CQs qS are not expressible as UCQs. However, the perfect V-rewritings of UCQs (actually, even of Datalog queries) qS can always be expressed in DD, and can always be computed using the technique presented in Duschka and Genesereth (1998). Thus, Corollary 1 implies that, given a data integration system J=G,S,M with MPGAV and a UCQ S-query qS, we can compute the DD-maximally sound J-abstraction of qS as follows: (i) compute VJ, and (ii) compute and return the DD query corresponding to the perfect VJ-rewriting of qS.

Corollary 3. [(Cima et al., 2021, Corollary 3)] If J is a data integration system with PGAV mapping and qS is a UCQ S-query, then the DD-maximally sound J-abstraction of qS exists and is computable.

4.4. From abstraction to view-based query processing

As already observed, Duschka and Genesereth (1998) and Afrati and Chirkova (2019) show that for a given set V of UCQ views, UCQ-maximal V-rewritings of CQs may not exist. Combined with an observation made above, this means that perfect V-rewritings of CQs are in general not expressible as UCQs. We point out that the CQ qS used to prove such results contain more than one join existential variable. As a consequence, in the case of UCQ views V, it is still open whether (i) the result holds even for qS with just one join existential variable (ii) perfect V-rewritings of UCQJFEs are expressible as UCQs. By combining Corollary 1 with results of Cima et al. (2019) (that we will discuss in Section 5), we can actually answer positively to both questions.

Corollary 4. [(Cima et al., 2021, Corollary 4)] For a set V of UCQ views, the UCQ-maximal V-rewritings of qS may not exist, even if qS is a CQ with one join existential variable.

On the other hand, in Section 5, we will show that for a data integration systems J with PGAV mapping, UCQ-maximally sound J-abstractions of UCQJFEs are guaranteed to exist, and we will provide an algorithm to compute them (Theorem 5). Thus, given a set of UCQ views V over a schema S and a UCQJFE S-query qS, we can compute the perfect V-rewriting of qS as follows: (i) compute JV, and (ii) compute and return the UCQ-maximally sound JV-abstraction of qS. This leads to the following positive result for V-rewritings of UCQJFEs.

Corollary 5. [(Cima et al., 2021, Corollary 5)] If V is a set of UCQ views and qS is a UCQJFE S-query, then the perfect V-rewriting of qS is computable and can be expressed as a UCQ.

5. UCQ abstractions

In this section we investigate the problem of checking the existence of abstractions in the class UCQ, and of their computation. We first study the case of UCQ-minimally complete J-abstractions, then we switch to UCQ-maximally sound J-abstractions, and finally we tackle perfect J-abstractions in the class UCQ. We observe that all the results presented in this section appear in Cima et al. (2019).

On the positive side, we show that UCQ-minimally complete abstractions always exist, by providing an algorithm to compute them. In a nutshell, given a data integration system J=G,S,M and a UCQ qS=qS1qSn, an algorithm to compute the UCQ minimally-complete J-abstraction of qS returns the union of CQs of the form {xīYī.M(qSi)(xī)} obtained by simply “applying” the mapping M to each CQ qSi in qS, using ⊤ to bind the distinguished variables that are not involved in the application of M to qSi. Formally, applying the GLAV mapping M to a CQ q means to chase (Fagin et al., 2005) the atoms in q by using the tuple generating dependencies corresponding to the assertions in M.

Theorem 3. [(Cima et al., 2019, Theorem 13)] The UCQ-minimally complete J-abstraction of qS always exists and is computable.

On the negative side, the following shows that UCQ-maximally sound abstractions may not exist.

Theorem 4. [(Cima et al., 2019, Theorem 16)] The UCQ-maximally sound J-abstractions of qS may not exist if at least one of the following is true:

(a) qS contains a join existential variable;

(b) M contains a LAV mapping assertion;

(c) M contains a non-PGAV mapping assertion.

Interestingly, in order to illustrate the case (a) of the above theorem we can refer to a slight modification of the data integration system J introduced in Example 1. In particular, let J1=G,S,M1 be obtained from J by removing from M the mapping m1, and consider the query qS4 of Example 1. Note that M1PGAV and qS4 contains a join existential variable, x. Clearly, removing m1 has no impact on the abstraction of qS4. Thus, as already discussed in Example 1, there exists no UCQ-maximally sound J1-abstraction of qS4.

Motivated by Theorem 4, we next introduce a specific scenario, that we call restricted, obtained from the general one by limiting the mapping language to PGAV, and qS to be UCQJFEs. It can be shown that for such a restricted scenario, UCQ-maximally sound abstractions always exist. Intuitively, the latter can be derived by showing that for any UCQJFE qS and data integration system J=G,S,M with MPGAV, a CQ-maximally sound J-abstraction of qS may comprise at most kqSM atoms, where kqSM is an integer that depends on the number of atoms occurring in qS and the number of mapping assertions occurring in M. Hence, given a data integration system J with PGAV mapping and an UCQJFE qS, an algorithm to compute the UCQ-maximally sound J-abstraction of qS simply returns the union of all CQs qG comprising at most kqSM atoms, that are sound J-abstractions of qS. The crucial observation here is that in order to check whether qG is a sound J-abstraction of qS, it is sufficient to check whether unf(qG) qS, which is decidable, since both qS and unf(qG) are UCQs (Sagiv and Yannakakis, 1980).

Theorem 5. [(Cima et al., 2019, Theorem 21)] In the restricted scenario, the UCQ-maximally sound J-abstractions of qS always exists and is computable.

To conclude the section, we provide the last positive result about perfect abstractions in the class UCQ. Namely, we show that checking whether there exists a UCQ that is the perfect J-abstraction of qS is decidable. In particular, given a data integration system J with GLAV mapping and a UCQ qS, an algorithm to decide whether there exists a UCQ that is a perfect J-abstraction of qS proceeds as follows. First, it computes the query qG that is the UCQ-minimally complete J-abstraction of qS. Then, it checks whether qG is a sound abstraction of qS (as discussed above). If the answer is negative, then there exists no UCQ that is a perfect J-abstraction of qS. If the answer is positive, then qG is actually a UCQ, and is the perfect J-abstraction of qS. Thus the algorithm also solves the computation problem for perfect abstractions in the UCQ language.

Theorem 6. [Cima et al. (2019)] Checking whether there exists a query q in the class UCQ that is the perfect J-abstraction of qS is decidable. Moreover, there is an algorithm that computes q, whenever it exists.

6. Monotone abstractions

The notion of monotonicity defines a very natural class of queries that is popular in the field of databases and knowledge representation alike. The intuition behind monotone queries is simple: a query q is monotone if, whenever the data we posses increases, the answers for q do not decrease. In the literature, however, this notion has been formalized in two distinct ways. In the context of databases, a T-query q is monotone if, for every pair of T-databases D, D′ such that DD′, we have qDqD′. Even very simple FOL queries can be shown not to be monotone under this notion. On the other hand, in the context of mathematical logic, the notion of monotonicity comes in a different flavor: a T-query q is monotone, if, for every every set of interpretations Σ, Σ′ for T such that Σ⊆Σ′, we have qΣqΣ′. We observe here that, under the semantics of certain answers, FOL queries are monotone in this sense.

To define the notion of monotone queries in the context of a data integration system, we use the notion of monotonicty from logic. A G-query q is monotone in the context of a data integration system J=G,S,M if for every pair D, D′ of S-databases, mod(J,D)mod(J,D) implies qJ,DqJ,D. In the following, we use 𝔐J to denote the class of monotone queries in the context of J=G,S,M, and when J is understood, we simply use 𝔐.

This notion of monotonicity is natural yet broad enough to characterize some of the most popular classes of queries. For example, it is trivial to see that queries evaluated under certain answer semantics are monotone. In the light of this consideration, it is natural to ask whether perfect and approximated abstractions in the class of monotone queries always exist for a given class of source queries and whether they can be computed. Moreover, one can show that, whenever an 𝔐-maximally sound (resp., 𝔐-minimally complete) J-abstraction exists, then it is unique. Therefore, from now on, given a source query qS, we will talk about the 𝔐-maximally sound (resp., the 𝔐-minimally complete) J-abstraction of qS.

In the remainder of this section, we survey recent results on monotone abstractions of UCQs presented in Cima et al. (2022). We introduce a language of monotone queries, called DDK, with attractive computational properties (Section 6.1). For the case of data integration systems with no axioms in both the global schema and in the source schema, we show that minimally complete and maximally sound monotone abstractions for UCQ source queries always exist, and are expressible in DDK. From these results, we also derive the decidability of checking whether a perfect monotone abstraction of a given source query exists (Section 6.2).

6.1. A language for monotone abstractions

Monotone queries form a natural yet expressive class of queries. Unsurprisingly, perfect and approximated monotone abstractions require a suitably expressive query language. We now introduce one such language and discuss some of its most compelling computational characteristics. The language, called DDK, is based on disjunctive Datalog, extended with an epistemic operator. We present it in a form specifically tailored for querying data integration systems.

Assume a data integration system J=G,M,S and an alphabet of predicate symbols Int, called intensional predicate symbols, disjoint from the alphabets of G and S. We now consider the case where the logical theories corresponding to both G and S may have a nonempty set of axioms.

A DDK query for J includes a set of rules, each one of two possible forms:

• the typical form of disjunctive Datalog, i.e.,

b1bmi1in    (1)

where b1, …, bm and i1, …, in are atoms on intensional predicates, and

• a new form specified as follows

K(ϕ1(x̄)ϕm(x̄))i{1..n}yī.ψi(x̄,yī)    (2)

where each ψi is a conjunction of atoms over Int, and each ϕi is of the form z̄.γ(x̄,z̄)ξ(x̄), with γ(x̄,z̄) a conjunction of atoms over G, and ξ(x̄) a conjunction of inequalities over variables in x̄ only.

An n-ary DDK query q for J is a pair q=Ans,R where R is a finite set of DDK rules, called the definition of q, and Ans is an n-ary intensional predicate in Int, called the answer predicate of q.

Answers for DDK queries are defined based on the notions presented in Calvanese et al. (2007a). An interpretation for q is a pair I = (E, f), where E is a set of interpretations for J, and f is an interpretation for Int with domain C. An interpretation I = (E, f) satisfies a DDK rule ρ of q (written I⊧ρ) if the following conditions hold:

• If ρ is a formula of the form (1), then I⊧ρ if f⊧ρ, i.e., f satisfies the implication in (1).

• If ρ is a formula of the form (2), then I⊧ρ if for all tuples c̄ of values in C, if I satisfies the epistemic formula K(iϕi(c¯))}, then there is j such that ȳj.ψj(c̄j,ȳj) is true in f.

An interpretation I for q is called a model of q if all the rules in the definition of q are satisfied by I. It should be clear that, under this definition of semantics, K represents the “knowledge” operator of the modal logic system S5. In other words, the formula Kα should be read as “α is known (i.e., logically implied) by the system”.

We are ready to define what is the answer qJ,D of a DDK query q=Ans,R with respect to J and the S-database D. Specifically, qJ,D={c¯Ansf|(mod(J,D),f) is a model of q}.

While a thorough analysis of DDK is outside the scope of the present work, we mention some of its most appealing characteristics. Firstly, we observe that DDK generalizes UCQs. In particular, every UCQ q of m disjuncts is equivalent to a DDK query with one rule of the form (2) where the disjuncts of q are in the scope of K. Secondly, every DDK query q over J is monotone in the context of J. Intuitively, monotonicity follows from a simple form of stratification where certain answers to UCQs (rules of the form (2)) and recursive computations (rules (1)) never mix. In turn, this simple form of stratification guarantees that answering q over J boils down to the following: (i) computing certain answers for the UCQs in the scope of K in the left-hand side of rules of the form (1) in q, and (ii) computing the answers for the remaining rules (form (2)) over the result of the previous step. Monotonicity follows from the monotonicity of certain answers to UCQs, and from the fact that the rules of the form (2) define a monotone query. These considerations indicate a third appealing characteristic of DDK. Specifically, the decidability of answering a DDK query q w.r.t. J and D depends exclusively on the decidability of answering UCQs over J, as the following proposition shows.

Proposition 3. [(Cima et al., 2022, Proposition 2)] Answering DDK queries w.r.t. J and D is decidable if and only if computing the certain answers of UCQs w.r.t. J and D is decidable.

These results sharply contrast with similar results obtained for plain (non-disjunctive) Datalog. In particular, the undecidability of the latter can be proved even in the case of global schema axioms expressed in very simple Description Logics of the DL-Lite family (see, e.g., Levy and Rousset, 1998; Calvanese and Rosati, 2003).

6.2. Monotone abstractions via DDK

We now turn our attention to monotone abstractions expressed in DDK. We start by observing that, in terms of computational complexity, DDK perfectly fits the problem of computing approximated abstractions, as the following proposition shows.

Proposition 4. [(Cima et al., 2022, Proposition 3)] There exists a data integration system J with PGAV mapping and a UCQ qS such that answering the 𝔐-maximally sound J-abstraction of qS is coNP-hard in data complexity.

In the remainder of this section, we show that DDK is well-suited to express monotone abstractions, both perfect and approximated. In discussing this issue, we go back to our assumption of dealing with data integration systems with no axioms in both the global and the source schema. So, in what follows, we implicitly deal with a data integration system J=G,M,S, where G and S have no axioms, and a UCQ S-query qS=q1qn, where qi={x̄ȳi.ϕ(x̄,ȳ)}, for i = 1, …, n.

6.2.1. 𝔐-maximally sound abstractions

In Cima et al. (2022), it is shown that DDK can always express 𝔐-maximally sound J-abstractions of UCQs, by illustrating a technique that, given query qS, builds a set RJ of DDK rules whose intensional predicates are the predicates in S, and then uses such rules to construct the 𝔐-maximally sound J-abstractions of qS as a DDK query. We do not describe the technique in detail here. Rather, we use an example to give an intuition of the construction.

Example 8. Given the following mapping in J:

m1:y.s1(x)s2(x,y)g1(x,x)m2:s1(x)s3(x,y)g1(x,y)m3:s4(x)y.g1(x,y)

RJ is the following set of DDK rules:

K(g1(x,x))(y.s1(x)s2(x,y))(s1(x)s3(x,x))K(g1(x,y)xy)s1(x)s3(x,y)K(y.g1(x,y))s4(x)(y.s1(x)s3(x,y))                             (y.s1(x)s2(x,y))

Intuitively, the rules of RJ specify, for the various facts over G that are certain, i.e., that are known to hold, the queries over the sources that generate them. For example, the first rule of RJ specifies that, if a constant is known to satisfy g1(x, x), then this knowledge derives either from the answers to the source query {x|∃y.s1(x)∧s2(x, y)} or from the answers to the source query {x|s1(x)∧s3(x, x)}. As another example, the second rule of RJ specifies that the pairs of distinct constants x, y known to satisfy g1(x, y) derive from the query {x, y|s1(x)∧s3(x, y)}. It can be shown that this is crucial for ensuring that the abstraction of queries involving the join of s1 and s3, which is based on the certain answers of g1, do not include data deriving from source queries whose abstraction is based on the certain answers of the projection of g1. Finally, the third rule of RJ takes care of those constants x known to satisfy g1(x, y), for some, not necessarily known, y. Such constants may derive from each of source queries above.

Using the notion of RJ, we can immediately obtain the 𝔐-maximally sound J-abstraction of qS, by adding to RJ the set A constituted by one rule of the form ϕi(x̄,ȳ)Ans(x̄) for each disjunct qi={x̄ȳi.ϕ(x̄,ȳ)} in qS.

Proposition 5. [(Cima et al., 2022, Theorem 2)] The DDK query Ans,RJA is the 𝔐-maximally sound J-abstraction of qS.

In the light of Proposition 5 and from the existence of an algorithm to compute RJA, we obtain the following.

Theorem 7. [(Cima et al., 2022, Theorem 2)] The 𝔐-maximally sound J-abstraction of qS always exists, is computable, and can be expressed in DDK.

6.2.2. 𝔐-minimally complete abstractions

We show that DDK can always express 𝔐-minimally complete J-abstractions of UCQs.

Let us first introduce a useful notion. Given a CQ q={x̄ȳ.ϕ(x̄,ȳ)}, Saturate(q) denotes the UCQ with inequalities obtained as follows. For each possible unifier μ on the variables in x̄ȳ such that μ(x)x̄ for each xx̄, Saturate(q) contains a query obtained from μ(q) by adding an inequality atom (t1t2) for each pair of distinct variables t1, t2 occurring in μ(q). For a UCQ Q, we denote by Saturate(Q) the UCQ with inequalities consisting of the union of Saturate(q), for each disjunct q of Q. It is easy to see that Saturate(Q) is equivalent to Q, for every UCQ Q.

Consider a disjunct qh in in Saturate(qS). Clearly, qh is a CQ with inequalities of the form qh={x̄ȳ.ϕ(x̄,ȳ)χ(x̄,ȳ)}, where χ(x̄,ȳ) are inequality atoms. Let M(qh) denote the result of chasing the set of relational atoms occurring in qh with M. Let ρqh denote the DDK rule K(M(qh)(x̄)χ(x̄,ȳ))Ans(x̄). Finally, let qc denote the DDK query consisting of all the rules ρqh for the various qh in Saturate(qS) and with answer predicate Ans. We can now prove the following.

Proposition 6. [(Cima et al., 2022, Theorem 1)] qc is the 𝔐-minimally complete J-abstraction of qS.

The following statement is a straightforward consequence of Proposition 6.

Theorem 8. [(Cima et al., 2022, Theorem 1)] The 𝔐-minimally complete J-abstraction of qS always exists, is computable, and can be expressed in DDK.

6.2.3. Perfect monotone abstractions

From the results presented above, we can derive an algorithm for checking whether there exists a query in 𝔐 that is the perfect J-abstraction of qS. In particular, observe that if the perfect J-abstraction of qS can be expressed as a query in 𝔐, then it is J-equivalent to the 𝔐-minimally complete J-abstraction of qS. Then, from Proposition 6 we know that, in order to check whether there exists a query in 𝔐 that is the perfect J-abstraction of qS, we have to check whether qS is equivalent to qc modulo J.

To this end, we observe the following. There exists a UCQ with inequalities S-query qmin such that qminD=qcJ,D, for every S-database D. Moreover, qmin is computable. These two properties result from J being a GLAV data integration system with no source and global schema axioms, and from the specific form of qc. Therefore, in order to check whether there exists a query in 𝔐 that is the perfect J-abstraction of qS, we just need to check whether qminqS. The next claim follows from these considerations.

Theorem 9. [(Cima et al., 2022, Theorem 3)] Checking whether there exists a query q in the class 𝔐 that is the perfect J-abstraction of qS is decidable. Moreover, there is an algorithm that computes q, whenever it exists.

7. Non-monotone abstractions

So far, we have limited our analysis of the abstraction reasoning task by focusing on monotone query languages in the context of data integration systems. There exist, however, very simple scenarios in which the perfect abstraction can only be expressed by means of a non-monotone query.

Example 9. Let J=G,S,M be such that the global schema G has the predicates {A/1, B/1, C/1}, the source schema S has the predicates {s1/1, s2/1}, and M={m1,m2,m3,m4}, where:

m1:s1(x)A(x)m2:s2(x)A(x)m3:s2(x)B(x)m4:s1(x)s2(x)C(x)

Consider the query qS={xs1(x)}. One can verify that the perfect J-abstraction of qS is the non-monotone query qG such that, given an S-database D, returns those x for which either (A(x)∧¬B(x)) or C(x) is known to be true, i.e. holds in every G-database B such that Bmod(J,D).

Motivated by the above example, in this section we summarize the most salient aspects of the results in Cima et al. (2020), which investigates the problem of finding perfect (resp. minimally complete, maximally sound) abstractions expressed in the query language EQL-Lite(UCQ).4 For instance, refer to Example 9. The perfect J-abstraction of qS written there in natural language can be formulated through the EQL-Lite(UCQ) query qG={(x)K(A(x)¬B(x))K(C(x))}. As in the case of the UCQ and the 𝔐 classes, it can be shown that if an EQL-Lite(UCQ)-maximally sound (resp., EQL-Lite(UCQ)-minimally complete) J-abstraction of qS exists, then it is unique up to J-equivalence. Thus, in what follows, we will simply talk about the EQL-Lite(UCQ)-maximally sound (resp., EQL-Lite(UCQ)-minimally complete) J-abstraction of qS.

A natural question that arises is whether “best” abstractions in the EQL-Lite(UCQ) query language always exist. Unfortunately, the following theorem shows that this is not the case for both EQL-Lite(UCQ)-minimally complete abstractions and EQL-Lite(UCQ)-maximally sound abstractions.

Theorem 10. [(Cima et al., 2020, Theorems 1 and 2)] Both the EQL-Lite(UCQ)-minimally complete J-abstractions of qS and the EQL-Lite(UCQ)-maximally sound J-abstractions of qS may not exist.

Due to the above negative result, which holds already for CQJFE queries qS and data integrations systems with PGAV mappings, we now explore two alternative restricted scenarios. The former weakens the target query language for expressing abstractions by considering a fragment of EQL-Lite(UCQ), whereas the latter weakens the mapping language by considering a special case of GLAV. In both the restricted scenarios, we assume that source queries are CQs rather than UCQs.

7.1. A restricted non-monotone query language

We now consider the problem of finding abstractions expressed in EQL-Lite(UCQ), which corresponds to the fragment of EQL-Lite(UCQ) where both nested negation and union operators are disallowed. More formally, an EQL-Lite(UCQ) query q is an expression of the form q={x̄φ(x)} where φ(x) is an EQL formula built according to the following syntax:

φ::= Kϱy.φ φ1φ2  ¬δδ::= Kϱy.δ

with ϱ being a disjunction of conjunction of atoms over G possibly involving existentially quantified variables. For instance, the EQL-Lite(UCQ) query qG illustrated above, which corresponds to the perfect J-abstraction of qS in Example 9, is not an EQL-Lite(UCQ) query.

On the negative side, even in this scenario, maximally sound abstractions are not guaranteed to exist, and this holds already for CQJFE queries qS and data integrations systems with PGAV mappings.

Theorem 11. [(Cima et al., 2020, Theorem 2)] The EQL-Lite(UCQ)-maximally sound J-abstractions of qS may not exist.

On the positive side, we now provide an algorithm for computing EQL-Lite(UCQ)-minimally complete J-abstractions of CQs qS. The algorithm is similar to the one for the UCQ case (cf. Section 5), expect that all the atoms obtained when applying the mapping to the given CQ occur inside the scope of the epistemic operator K, binding also the existential variables coming from the input query. More precisely, given a data integration system J=G,S,M and a CQ qS={x̄ȳ.ϕ(x̄,ȳ)}, the algorithm returns the EQL-Lite(UCQ) query qG={x̄Ȳ.K(z̄.M(qS)(x̄))}, where Ȳȳ are the existential variables of qS occurring in M(qS), while z̄ are the fresh existential variables introduced when applying M to qS. To see the difference with the UCQ case, recall Example 1 in the introduction and the CQ qS2 therein. While qG2 is the UCQ-minimally complete J-abstraction of qS2, the EQL-Lite(UCQ) query {x∣∃y.K(g1(x, y))} returned by the above algorithm is a better complete approximation than qG2, and is in fact the perfect J-abstraction of qS2.

Theorem 12. [(Cima et al., 2020, Theorem 5)] The EQL-Lite(UCQ)-minimally complete J-abstraction of a CQ qS always exists and is computable.

We further notice that the above algorithm returns queries that are monotone and that are expressible in DDK, thus proving that, without disjunction, the limited form of negation allowed in EQL-Lite(UCQ) does not give more expressive power in finding minimally complete (and therefore also perfect) abstractions of CQs. On the contrary, it can be shown that inequalities give more expressive power in finding abstractions. In particular, there exist 𝔐-minimally complete J-abstractions of CQs that cannot be expressed in EQL-Lite(UCQ), whereas, as shown in the previous section, they can be expressed in DDK.

Given a query qG as returned by the above algorithm, it is always possible to compute a UCQ qu such that quD=qGJ,D for every S-database D. Thus, following the same line of reasoning as the one at the end of the previous section, in this scenario we can solve the computation problem for perfect abstractions of CQs.

Theorem 13. [Cima et al. (2020)] Checking whether there exists a query q in EQL-Lite(UCQ) that is the perfect J-abstraction of a CQ qS is decidable. Moreover, if it exists, then q is a monotone query and there is an algorithm that computes it.

7.2. One-to-one mapping

We now examine the problem of finding abstractions in the presence of data integration systems J=G,S,M such that M is a one-to-one mapping. A one-to-one mapping is a special case of GLAV, constituted by a set of assertions of the form ȳ.s(x̄,ȳ)z̄.g(x̄,ȳ), where s(x̄,ȳ) and g(x̄,ȳ) are single atoms without constants or repeated variables.

The first result is that the algorithm previously presented for computing EQL-Lite(UCQ)-minimally complete abstractions of CQs can be also used for computing EQL-Lite(UCQ)-minimally complete abstractions of CQs for data integration systems J with PGAV mapping.

Theorem 14. [(Cima et al., 2020, Theorem 3)] Under one-to-one mappings, the EQL-Lite(UCQ)-minimally complete J-abstraction of a CQ qS always exists, is computable, and is a monotone query.

Thus, using exactly the same considerations done for the case of EQL-Lite(UCQ), we can solve the computation problem for perfect abstractions in EQL-Lite(UCQ) of CQs under one-to-one mappings.

Theorem 15. [Cima et al. (2020)] Under one-to-one mappings, checking whether there exists a query q in EQL-Lite(UCQ) that is the perfect J-abstraction of a CQ qS is decidable. Moreover, if it exists, then q is a monotone query and there is an algorithm that computes it.

We now turn to the sound case under one-to-one mappings. Specifically, in this scenario, while the existence of EQL-Lite(UCQ)-maximally sound J-abstractions of CQs is still an open problem, we present an algorithm for computing EQL-Lite(UCQ)-maximally sound J-abstractions of CQJFEs qS. Roughly speaking, given a data integration system J=G,S,M with M a one-to-one mapping and a CQJFE qS, as a first step the algorithm computes the EQL-Lite(UCQ)-minimally complete qG of qS and its UCQ reformulation qu such that quD=qGJ,D for each S-database D. Then, for each CQ q′ which is a disjunct of qu such that qqS, the algorithm adds in conjunction to the body of qG the negation of the body of the EQL-Lite(UCQ)-minimally complete of q′. Informally, this last step prevents qG to return answers that are not answers of qS, guaranteeing soundness of the output query. For instance, recall Example 1, and let J=G,S,M be the data integration system with M={m1,m2,m3} a one-to-one mapping. The query returned by the algorithm is the EQL-Lite(UCQ) query {xK(A(x))∧¬K(B(x))}, which is the EQL-Lite(UCQ)-maximally sound J abstraction of qS.

Theorem 16. [(Cima et al., 2020, Theorem 4)] Under one-to-one mappings, the EQL-Lite(UCQ)-maximally sound J-abstraction of a CQJFE qS always exists and is computable.

We conclude this section with the following observation. The algorithms sketched above for computing “best” abstractions always return an EQL-Lite(UCQ) query. This directly implies that, under one-to-one mappings, the query languages EQL-Lite(UCQ) and EQL-Lite(UCQ) have the same expressive power in finding all three kinds of abstractions (perfect, minimally complete, and maximally sound).

8. Open problems

We have provided an overview of data abstraction, and we have illustrated some results obtained in recent years on computing abstractions. We conclude the paper by discussing a set of issues related to abstractions that deserve more investigation.

8.1. Data quality

While data quality is one the main issues proposed in Data-centric AI, there is no general and well-established methodology for leveraging data quality for improving Machine Learning methods.

As pointed out in Chen et al. (2021), poor data quality has a direct impact on the performance of the machine learning system that is built on the data. It is therefore important to devise techniques for validating the quality of both training and testing datasets. Recent work in this direction shows a strong correlation between the quality of the datasets and the performance of the machine learning system, and demonstrates that a rigorous evaluation of data quality is necessary for guiding the quality improvement of machine learning. We believe that formal methods like data abstraction can provide some contributions toward this goal. For example, by helping in making the semantics of training data explicit, abstraction can provide support for recognizing biases or other problems in the data used to train a Machine Learning Model. Making concrete steps in this direction is a stimulating research challenge.

8.2. Languages for abstractions

A crucial issue related to abstraction is to compute perfect and approximated abstractions within specific classes of queries. For the fundamental class UCQ, the decidability of checking whether there exists a UCQ-maximally sound abstraction of a UCQ source query is still open. More generally, there are many interesting classes of queries that can be used to express abstractions, and for which it would be interesting to compute perfect, or approximated abstractions. For example, in the case of graph databases as virtual views, relevant classes of queries for abstractions include regular path queries, or two-way conjunctive regular path queries.

8.3. Abstraction and monotonicity

In this paper we have discussed the use of DDK to express monotone abstractions of source queries in the class UCQ. It would be interesting to investigate which is the minimal expressive power needed for capturing perfect and approximated monotone abstractions of source queries. Also, it is not difficult to see that there are queries for which the perfect abstraction is non-monotone. Although first results on non-monotone abstractions have appeared in Cima et al. (2020), the issue of checking the existence of and computing non-monotone abstractions is largely unexplored.

8.4. Expressive source queries

The majority of work on abstraction so far focused on source queries in the class UCQ. It would be interesting to address the problem of computing perfect and approximated abstractions of source queries expressed in more expressive languages such as Datalog. More expressive mapping languages (e.g., UCQ with inequalities in the GLAV type of mapping) also deserve attention.

8.5. Axioms

The computation of abstractions in the presence of axioms in the global schema or in the source schema is another interesting problem to study. First results in this direction appeared in Cima (2017), Lutz et al. (2018), and Cima et al. (2019), but the topic requires a more thorough analysis.

8.6. Reverse engineering

Abstraction has also interesting connections with the reverse-engineering problem (Barceló and Romero, 2017). When casted in data integration, given a source database D and set P of tuples, this problem aims at finding a global schema query q that captures P, i.e., such that the answers of q with respect to D captures the tuples in P. Despite the intuitive connection, a detailed analysis of the relationship between the two problems is missing.

8.7. User requirements

Finally, crucial aspects of abstractions, such as succinctness and clarity, have not been considered in this paper. More generally, issues related to the adequacy of the formulation of abstractions with respect to user requirements deserve greater attention.

Author contributions

All authors listed have made a substantial, direct, and intellectual contribution to the work and approved it for publication.

Funding

This work has been partially supported by MUR under the PRIN 2017 project HOPE (prot. 2017MMJJRE), by the EU under the H2020-EU.2.1.1 project TAILOR, grant id. 952215, and by MUR under the PNRR project PE0000013-FAIR.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Footnotes

1. ^In principle, we could also consider databases that are infinite structures.

2. ^When we refer to UCQ views, we in fact assume that view definitions are UCQs without repeated variables in the target list. We refer to Afrati and Chirkova (2019) for the complications that can arise when this assumption is removed.

3. ^This is not the case when view definitions are expressed as regular path queries rather than UCQs (Calvanese et al., 2002).

4. ^Actually, we consider the slightly restricted version of EQL-Lite(UCQ) which does not allow the use of (in)equalities.

References

Abedjan, Z., Golab, L., and Naumann, F. (2017). “Data profiling: a tutorial,” in Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD 2017) (Chicago, IL), 1747–1751. doi: 10.1145/3035918.3054772

CrossRef Full Text | Google Scholar

Afrati, F. N., and Chirkova, R. (2019). Answering Queries Using Views. Synthesis Lectures on Data Management, 2nd ed. San Rafael, CA: Morgan and Claypool Publishers. doi: 10.1007/978-3-031-01871-8

CrossRef Full Text | Google Scholar

Barceló, P., and Romero, M. (2017). “The complexity of reverse engineering problems for conjunctive queries,“ in Proceedings of the Twentieth International Conference on Database Theory (ICDT 2017), Volume 68 of Leibniz International Proceedings in Informatics, 7:1–7:17. Available online at: https://www.dagstuhl.de/en/publications/lipics (accessed June 15, 2023).

Google Scholar

Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., and Rosati, R. (2007a). “EQL-lite: effective first-order query processing in description logics,”in Proceedings of the Twentieth International Joint Conference on Artificial Intelligence (IJCAI 2007) (Hyderabad), 274–279.

Google Scholar

Calvanese, D., De Giacomo, G., Lenzerini, M., and Vardi, M. Y. (2000). “What is view-based query rewriting?” in Proceedings of the Seventh International Workshop on Knowledge Representation meets Databases (KRDB 2000), Volume 29 of CEUR Electronic Workshop Proceedings, 17–27. Available online at: http://ceur-ws.org/ (accessed June 15, 2023).

Google Scholar

Calvanese, D., De Giacomo, G., Lenzerini, M., and Vardi, M. Y. (2002). “Lossless regular views,”in Proceedings of the Twenty-First ACM SIGACT SIGMOD SIGART Symposium on Principles of Database Systems (PODS 2002) (Madison, WI: ACM), 58–66. doi: 10.1145/543613.543646

CrossRef Full Text | Google Scholar

Calvanese, D., De Giacomo, G., Lenzerini, M., and Vardi, M. Y. (2007b). View-based query processing: on the relationship between rewriting, answering and losslessness. Theor. Comput. Sci. 371, 169–182. doi: 10.1016/j.tcs.2006.11.006

CrossRef Full Text | Google Scholar

Calvanese, D., and Rosati, R. (2003). “Anwering recursive queries under keys and foreign keys is undecidable,”in Proceedings of the Tenth International Workshop on Knowledge Representation meets Databases (KRDB 2003), Volume 79 of CEUR Electronic Workshop Proceedings. Available online at: http://ceur-ws.org/ (accessed June 15, 2023).

Google Scholar

Chen, H., Chen, J., and Ding, J. (2021). Data evaluation and enhancement for quality improvement of machine learning. IEEE Trans. Reliab. 70, 831–847. doi: 10.1109/TR.2021.3070863

CrossRef Full Text | Google Scholar

Cima, G. (2017). “Preliminary results on ontology-based open data publishing,”in Proceedings of the Thirtieth International Workshop on Description Logics (DL 2017), Volume 1879 of CEUR Electronic Workshop Proceedings. Available online at: http://ceur-ws.org/ (accessed June 15, 2023).

Google Scholar

Cima, G., Console, M., Lenzerini, M., and Poggi, A. (2021). “Abstraction in data integration,”in Proceedings of the Thirty Sixth IEEE Symposium on Logic in Computer Science (LICS 2021) (Rome: IEEE), 1–11. doi: 10.1109/LICS52264.2021.9470716

CrossRef Full Text | Google Scholar

Cima, G., Console, M., Lenzerini, M., and Poggi, A. (2022). “Monotone abstractions in ontology-based data management,”in Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI 2022), 5556–5563. doi: 10.1609/aaai.v36i5.20495

CrossRef Full Text | Google Scholar

Cima, G., Lenzerini, M., and Poggi, A. (2017). “Semantic technology for open data publishing,”in Proceedings of the Seventh International Conference on Web Intelligence, Mining and Semantics (WIMS 2017) (Amantea), 1. doi: 10.1145/3102254.3102255

CrossRef Full Text | Google Scholar

Cima, G., Lenzerini, M., and Poggi, A. (2019). “Semantic characterization of data services through ontologies,”in Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI 2019) (Macao), 1647–1653. doi: 10.24963/ijcai.2019/228

CrossRef Full Text | Google Scholar

Cima, G., Lenzerini, M., and Poggi, A. (2020). “Non-monotonic ontology-based abstractions of data services,”in Proceedings of the Seventeenth International Conference on Principles of Knowledge Representation and Reasoning (KR 2020), 243–252. doi: 10.24963/kr.2020/25

CrossRef Full Text | Google Scholar

Duschka, O. M., and Genesereth, M. R. (1997). “Answering recursive queries using views,”in Proceedings of the Sixteenth ACM SIGACT SIGMOD SIGART Symposium on Principles of Database Systems (PODS 1997) (New York, NY), 109–116. doi: 10.1145/263661.263674

CrossRef Full Text | Google Scholar

Duschka, O. M., and Genesereth, M. R. (1998). “Query planning with disjunctive sources,”in Proceedings of the AAAI-98 Workshop on AI and Information Integration (Cambridge, MA: AAAI/The MIT).

Google Scholar

Eiter, T., Gottlob, G., and Mannilla, H. (1997). Disjunctive datalog. ACM Trans. Database Syst. 22, 364–418. doi: 10.1145/261124.261126

CrossRef Full Text | Google Scholar

Fagin, R., Kolaitis, P. G., Miller, R. J., and Popa, L. (2005). Data exchange: semantics and query answering. Theor. Comput. Sci. 336, 89–124. doi: 10.1016/j.tcs.2004.10.033

CrossRef Full Text | Google Scholar

Halevy, A. Y. (2001). Answering queries using views: a survey. Very Large Database J. 10, 270–294. doi: 10.1007/s007780100054

CrossRef Full Text | Google Scholar

Lenzerini, M. (2002). “Data integration: a theoretical perspective,”in Proceedings of the Twenty-First ACM SIGACT SIGMOD SIGART Symposium on Principles of Database Systems (PODS 2002) (New York, NY: ACM), 233–246. doi: 10.1145/543613.543644

CrossRef Full Text | Google Scholar

Levy, A. Y., Mendelzon, A. O., Sagiv, Y., and Srivastava, D. (1995). “Answering queries using views,”in Proceedings of the Fourteenth ACM SIGACT SIGMOD SIGART Symposium on Principles of Database Systems (PODS 1995) (San Jose, CA: ACM Press), 95–104. doi: 10.1145/212433.220198

CrossRef Full Text | Google Scholar

Levy, A. Y., Rajaraman, A., and Ordille, J. J. (1996). “Querying heterogenous information sources using source descriptions,”in Proceedings of the Twenty-Second International Conference on Very Large Data Bases (VLDB 1996) (Mumbai), 251–262.

Google Scholar

Levy, A. Y., and Rousset, M.-C. (1998). Combining Horn rules and description logics in CARIN. Artif Intell. 104, 165–209. doi: 10.1016/S0004-3702(98)00048-4

CrossRef Full Text | Google Scholar

Lutz, C., Marti, J., and Sabellek, L. (2018). “Query expressibility and verification in ontology-based data access,”in Proceedings of the Sixteenth International Conference on the Principles of Knowledge Representation and Reasoning (KR 2018) (Tempe, AZ), 389–398.

PubMed Abstract | Google Scholar

Nash, A., Segoufin, L., and Vianu, V. (2010). Views and queries: aeterminacy and rewriting. ACM Trans. Database Syst. 35, 1–21. doi: 10.1145/1806907.1806913

CrossRef Full Text | Google Scholar

Pottinger, R., and Halevy, A. Y. (2001). MiniCon: a scalable algorithm for answering queries using views. Very Large Database J. 10, 182–198. doi: 10.1007/s007780100048

CrossRef Full Text | Google Scholar

Sagiv, Y., and Yannakakis, M. (1980). Equivalences among relational expressions with the union and difference operators. J. ACM 27, 633–655. doi: 10.1145/322217.322221

CrossRef Full Text | Google Scholar

Keywords: knowledge representation, abstraction, automated reasoning, data integration, data preparation

Citation: Cima G, Console M, Lenzerini M and Poggi A (2023) A review of data abstraction. Front. Artif. Intell. 6:1085754. doi: 10.3389/frai.2023.1085754

Received: 31 October 2022; Accepted: 30 March 2023;
Published: 23 June 2023.

Edited by:

Giovanni Sileno, University of Amsterdam, Netherlands

Reviewed by:

Federica Mandreoli, University of Modena and Reggio Emilia, Italy
Joao Pita Costa, UNESCO International Research Center on Artificial Intelligence - IRCAI, Slovenia
Pablo Barcelo, Pontifical Catholic University of Chile, Chile

Copyright © 2023 Cima, Console, Lenzerini and Poggi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Antonella Poggi, poggi@diag.uniroma1.it

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.