- Dipartimento di Ingegneria Informatica, Automatica e Gestionale “A. Ruberti”, Sapienza University of Rome, Rome, Italy
It is well-known that Artificial Intelligence (AI), and in particular Machine Learning (ML), is not effective without good data preparation, as also pointed out by the recent wave of data-centric AI. Data preparation is the process of gathering, transforming and cleaning raw data prior to processing and analysis. Since nowadays data often reside in distributed and heterogeneous data sources, the first activity of data preparation requires collecting data from suitable data sources and data services, often distributed and heterogeneous. It is thus essential that providers describe their data services in a way to make them compliant with the FAIR guiding principles, i.e., make them automatically Findable, Accessible, Interoperable, and Reusable (FAIR). The notion of data abstraction has been introduced exactly to meet this need. Abstraction is a kind of reverse engineering task that automatically provides a semantic characterization of a data service made available by a provider. The goal of this paper is to review the results obtained so far in data abstraction, by presenting the formal framework for its definition, reporting about the decidability and complexity of the main theoretical problems concerning abstraction, and discuss open issues and interesting directions for future research.
1. Introduction
Despite the increasing centrality of data in AI, the way in which AI deals with data has remained virtually unchanged since the dawn of the discipline. This has to be contrasted with the well-known fact that Artificial Intelligence (AI), and in particular Machine Learning (ML), is not effective without good data preparation, as also pointed out by the recent wave of data-centric AI. The term “data centric” refers to an architecture where data is the primary and permanent asset. So, data preparation precedes the implementation of any given machine learning task, and can potentially support many of such tasks relying on the same domain. More specifically, data preparation is the process of gathering, transforming and cleaning raw data prior to processing and analysis. It is therefore regarded as an important step in any data engineering and data science projects, including machine learning, involving tasks such as understanding, collecting and reformatting data, aggregating, integrating, combining and enriching raw source data and making modifications and corrections in order to meet quality standards.
The first activity of data preparation requires collecting data from suitable data sources and data services, often distributed and heterogeneous. In the era of data as driving asset both for the private and public domain, the availability of services providing data, also called data services, is indeed growing incredibly fast. Thus, on one hand, more and more data services are available, on the other hand, more and more AI tasks and applications rely on data services. This scenario opens two crucial issues for data-centric AI. First, from a consumer point of view, how to find the “right” data, i.e., data which properly respond to an information need? Second, from a provider point of view, how to release FAIR-compliant data services, i.e., services automatically Findable, Accessible, Interoperable, and Reusable (FAIR)? An effective answer to the former question is given by exploiting the state of the art technology for answering queries over data integration systems, which stems from more than thirty years of research. As for the second question, an answer is given by the results on a relatively new service of data integration systems, called abstraction. In order to elaborate more on both these answers, let us first make a step back to data integration.
Data integration is the problem of providing a unified and reconciled view of the data stored in a set of autonomous and heterogeneous sources. The theoretical works on data integration systems have advocated a three-layer architecture comprising the data sources, which in our setting are the output of the data services, the global schema, which is a unified shared conceptualization of the domain of interest, and the mapping between the sources and the global schema. Formally, a data integration system is a triple , where is the global schema, is the source schema and is the mapping, i.e., a set of logical assertions describing how the data at the sources relate to the elements of the global schema. Then, intuitively, given a set of data sources D, represents all the (possibly incomplete) databases that are instances of satisfying w.r.t. D.
Once data services have been integrated by means of a data integration system specified through a triple , in order to find the “right data,” a data service consumer can rely on query answering. Specifically, by unambiguously expressing an information need as a query over the shared vocabulary of , he can get the answers that “best correspond” to his need without even having to know the relevant data services. In particular, in most approaches, such answers have been identified as certain answers, i.e. answers to that would be returned by every database represented by given a set of data sources D. Also, typically, such answers are computed by first reformulating in terms of a query and then by evaluating over D. Conversely, in order to make a data service FAIR-compliant, a provider can rely on abstraction over . Specifically, given a data service originally expressed as a query over a set of data sources, he can get a query over the shared vocabulary of , that unambiguously describes the data service content, thus making it both accessible, interoperable and reusable. Concretely, given a query over the data sources, he would get a query over the global schema whose answers “best correspond” to the data service. Obviously, also for abstraction, the meaning of “best correspond” has to be made precise. Ideally, the query is the one whose certain answers are exactly the answers of , for every possible source database. Such a query is called perfect J-abstraction of .
We next use an example for informally introducing and illustrating the main notions related to abstraction. In the example, we focus on queries that are conjunctions of atoms, called conjunctive queries (CQ), and unions thereof, called unions of conjunctive queries (UCQ), and we assume that the evaluation of a query expressed over the global schema is based on the certain answer semantics.
Example 1. Let be a data integration system where the elements of the source schema are the predicates (with associated arity) {s1/1, s2/2, s3/1, s4/1, s5/2}, the elements of the global schema are {g1/2, g2/1, g3/2, g4/2, g5/1}, and contains the following assertions (where the free variables are implicitly universally quantified):
Consider the query . It is easy to see that, for every database D, the set of certain answers of coincides with the set of answers of w.r.t. D. It follows that the CQ is a perfect -abstraction of .
Consider the query . A natural candidate for the perfect -abstraction of is . Note, however, that the certain answers to include tuples in s1 that may not belong to s2, and therefore is not even a sound -abstraction of (i.e., it does not retrieve only tuples of ). Indeed, it can be shown that no UCQ exists that is a perfect -abstraction of . However, the query asking for those x such that g1(x, y) is known to be true, i.e., holds in every model of , cannot exploit mapping m1, and therefore avoids retrieving tuples from s1. It follows that such query, which is not expressible as a UCQ, is a perfect -abstraction of . Consider the query . Again, the natural candidate for the perfect -abstraction of is clearly . However, because of m2, the certain answers to also include the values in the first component of s2, and this means that is not a sound -abstraction of , although it is a complete one (i.e., it retrieves all tuples of ). Another possible candidate is the query . However, this query captures only the tuples occurring in s1 which also occur in s4. It follows that is a sound -abstraction, although not a complete one. Actually, it can be shown that no perfect -abstraction of exists in the class UCQ, but and are, respectively, the minimally complete and the maximally sound -abstraction of in the class UCQ.
Consider now the query , and assume that we aim at checking whether its perfect -abstraction can be expressed as a UCQ. We immediately observe that {()∣∃x, y.g4(x, y)∧g2(x)} is a sound -abstraction of . Also, we can easily verify that {()∣∃x, y, x1.g4(x, y)∧g3(x, x1)∧g2(x1)} is also sound, and may retrieve tuples that are not retrieved by {()∣∃x, y.g4(x, y)∧g2(x)}. More generally, all queries of the form {()∣∃x, y, x1, …, xn.g4(x, y)∧g3(x, x1)∧…∧g3(xn−1∧xn)∧g2(xn)}, for n≥1, are pairwise incomparable sound -abstractions of . Based on this observation, one can show that there exists no maximally sound -abstraction of in the class UCQ. However, the following Datalog query (with goal Ans) is the maximally sound -abstraction of in the whole class of monotone queries:
We point out that, apart from the scenario of data services providers, data abstraction is relevant in several other contexts. We mention three of them here. In the context of ontology-based data management, abstraction can be used to check whether the mapping provides the right coverage for expressing the relevant data services at the global schema level (Lutz et al., 2018). Also, abstractions can provide the semantics of open datasets and open APIs published by organizations, which is a key aspect for unchaining all the potentials of open data (Cima et al., 2017). Finally, abstraction can be the basis for a semantic-based approach to source profiling (Abedjan et al., 2017), again one of tasks of data preparation, in particular for describing the structure and the content of a data source in terms of the business vocabulary.
The goal of this paper is to review the main notions and results about abstraction. We present the formal framework for its definition, and report about the decidability and complexity of the main theoretical problems concerning abstraction, i.e., verification, existence, and computation. The roadmap of the paper is as follows:
• Section 2 introduces some relevant background about databases, queries, and data integration.
• Section 3 illustrates the formal framework for abstraction in data integration by providing some of the key definitions used throughout the paper.
• Section 4 reports results appearing in Cima et al. (2021) on the relationship between abstraction and another well-studied problem, namely view-based query processing (see, e.g., Halevy, 2001). The latter is the problem of answering a query over a schema in terms of a set of materialized views over . Interestingly, the established relationship between abstraction and view-based query processing sheds into light new results about both problems.
• Section 5 illustrates results related to the problem of computing best UCQ abstractions of UCQ source queries (Cima et al., 2019). The main results are that, while minimally complete abstractions are guaranteed to exist, this is not the case for maximally sound abstractions. Motivated by the latter result, a restricted scenario is introduced, in which the existence of maximally sound abstractions is always guaranteed.
• Section 6 surveys results on computing best monotone abstractions of UCQ source queries (Cima et al., 2022). The principal contributions are the definition of a novel monotone query language (in the context of data integration) and the discussion of how such a language is able to express all forms of the best monotone abstractions (perfect, or approximated).
• Section 7 presents results on computing abstractions of UCQ source queries in a specific, well-known non-monotone query language (Cima et al., 2020). The main results are that all forms of best abstractions are not guaranteed to exist in such a language, and, in virtue of this result, two interesting restricted scenarios are investigated.
• Finally, Section 8 concludes the paper by discussing possible future research on abstraction.
2. Preliminaries
2.1. Databases and queries
We assume a denumerable set of constant symbols C that is included in every alphabet that we shall consider. A database schema (or simply schema) is a logical theory, i.e., a finite set of logical axioms, over an alphabet of predicate symbols and constants from C. A -database is simply a model of , i.e., an interpretation for that satisfies all the axioms of , with the additional requirements that (i) the domain of D is C, (ii) every constant is interpreted into itself, and (iii) the extention of every predicate is finite.1 In what follows, we will often see a -database as a finite set of ground facts over , each of which corresponding to a tuple in the extension of the associated predicate.
As customary, a database query over a schema of arity n, or simply an n-ary -query, is a function associating to each -database a finite set of tuples of constants of arity n. Often, however, it is more convenient to specify queries using expressions from some formal language to which a semantics, i.e., an actual query function, is associated. In what follows, whenever we talk about a query language , we mean the class of all queries that can be expressed using and its associated semantics.
A fundamental query language for our work is the language of First-Order Logic (FOL) queries. A FOL query q for a schema is a -query defined by an expression of the form , where is a tuple of variables, called the distinguished variables of q, and is a FOL formula over alphabet of containing all the variables in . The arity of q is the arity of , and we will often use to say that are the free-variables of the FOL query q and write simply as . Moreover,we will use the predicate ⊤ to form atoms of any arity; such atoms will always be interpreted as true. Given a -database D and a FOL -query q of arity n, qD is the set of all tuples such that .
A conjunctive query (CQ) q over a schema is a FOL query of the form , where ȳ is a tuple of variables, called the existential variables of q, and is a finite conjunction of relational atom. Given a CQ , we say that an existential variable y∈ȳ is a join existential variable of q if it occurs more than once in the atoms of . In what follows, we say that a CQ q is a conjunctive query with join-free existential variables (CQJFE) if there is no join existential variable occurring in q.
Other classes of database queries considered in this paper are defined as customary in terms of both syntax and semantics. An atomic query is a FOL query where consists of a single relational atom. A union of conjunctive queries (UCQ) (resp., union of conjunctive queries with join-free existential variables (UCQJFE)) is a query defined as a finite union of CQs (resp., CQJFEs) having the same arity, called its disjuncts, and its semantics is defined via the associated FOL query. For the definition of Datalog, Disjunctive Datalog, and Disjunctive Datalog with inequalities (denoted by DD≠), we refer the reader to Eiter et al. (1997).
2.2. Querying sets of databases
In what follows, we will often need to extend the notion of database queries to sets of databases. A generalized -query of arity n is a function associating to each set of -databases a finite set of n-tuples of constants in C, called the answers of q for Σ and denoted qΣ. As customary, for two -queries q1 and q2, we write q1⊑q2 if for each set Σ of -databases, and we write q1≡q2 if both q1⊑q2 and q2⊑q1.
A common method to define a generalized -query is to lift the semantics of a -query to sets of -databases using the notion of certain answers. Given a -query q and a set Σ of -databases, the certain answers of q over Σ are defined as . Thus, in what follows, we consider that every generalized -query is such that given a set Σ of -databases, qΣ is the set of the certain answers of q over Σ. This small abuse of notation and the observation that qD = q{D} allow us to blur the distinction between queries and generalized queries. Therefore, from now on, unless otherwise specified, we will use the term -query for generalized -query.
2.3. Data integration
A data integration system (Lenzerini, 2002) is specified by a triple , where , the global schema, is a schema over an alphabet , , the source schema, is a schema over an alphabet (disjoint from , except for the set C), and is a mapping relating to . Specifically, is a finite set of assertions of the form , where is an -query and is a -query of the same arity as .
The semantics of is defined relative to an -database D, and, intuitively, is the set of all the -databases that satisfy with respect to D. A -database B satisfies with respect to D, denoted by , if it satisfies all the assertions in , i.e., for each .
Formally, the semantics of relative to D, denoted as , is defined as . We say that D is consistent with if . The answers to a -query q w.r.t. a data integration system and an -database D is simply , that we often write simply as . For two -queries q1 and q2, we write if for each -database D; and -equivalence are defined accordingly.
Specific classes of mappings considered in the literature are GAV, LAV, GLAV, PGAV, and SPGAV. We introduce them under the assumption that the queries appearing in mapping assertions are conjunctive queries or restricted forms thereof.
A GLAV mapping is a set of assertions of the form , where both and are conjunctive queries over and respectively, with distinguished variables .
A GAV mapping is a special case of GLAV, constituted by a set of assertions of the form , where (i) is a conjunctive query over and (ii) is an atomic -query. A pure GAV mapping (PGAV) is a GAV mapping in which each assertion is such that no repeated variables appear in . A PGAV mapping is called SPGAV (PGAV with single assertion per predicate) if it does not contain a pair of assertions with the same predicate symbol the right-hand side.
A LAV mapping is a special case of GLAV, constituted by a set of assertions , where (i) is an atomic -query and (ii) is a conjunctive query over with distinguished variables .
In what follows, we implicitly refer to a data integration system , and when we denote a query by (resp., ) we mean that the query is a -query (resp., -query).
2.4. The EQL-Lite(UCQ) language
EQL-Lite(UCQ) is a powerful query language in the context of data integrations systems introduced and studied in Calvanese et al. (2007a). An EQL-Lite(UCQ) -query q is an expression of the form where is an EQL formula built according to the following syntax:
with ϱ being a disjunction of conjunction of relational atoms over possibly involving existentially quantified variables. The semantics is based on the notion of satisfaction of EQL sentences w.r.t. epistemic interpretations, which are pairs with E being a set of interpretations and . We now inductively define when an epistemic interpretation satisfies an EQL sentence φ, written :
Then, the answers of an EQL-Lite(UCQ) query w.r.t. a data integration system and an -database D are those tuples of constants such that for every .
Example 2. Consider Example 1 and suppose we are interested in asking for all x such that there exists y such that we know (x, y) belongs to g1. This can be expressed in EQL-Lite(UCQ) as follows:
Note that the query is different from the query asking for all x such that we know there exists y such that (x, y) belongs to g1, which is expressed as follows:
Indeed, while it can be verified that the answers to over coincide with the answers to the query , the answers to over coincide with the answers to the query . ⃤
3. Framework
We proceed to introduce the notion of query abstraction following Cima et al. (2019) for the basic definitions. We say that is a perfect -abstraction of if , for each -database D consistent with . Clearly, if a perfect -abstraction of exists, then it is unique up to -equivalence, i.e., if q′ is a perfect -abstraction of then . Therefore in the following we will talk about the perfect -abstraction of .
Example 3. Consider Example 1. It is easy to verify that is the perfect -abstraction of . ⃤
The following theorem presents a preliminary characterization of the existence of perfect -abstractions.
Theorem 1. [(Cima et al., 2021, Theorem 1)] There exists a perfect -abstraction of if and only if for all pair D, D ′ of -databases, implies .
As the condition of being a perfect -abstraction of source query is rather strong one, it might be very well the case that such a global schema query may not exist.
Example 4. Consider again Example 1. Using Theorem 1, we can show that there exists no perfect -abstraction for . In fact, for the databases D = {s5(a, b)} and , we have but while . ⃤
In these cases, it is reasonable to consider weaker notions, such as sound or complete approximations of perfectness. We say that is a complete -abstraction of if , for each -database D consistent with . Similarly, we say that is a sound -abstraction of if , for each -database D consistent with . Obviously, one is interested in complete or sound abstractions that approximate at best, at least in the context of a specific class of queries. If is a class of queries, we say that a global schema query is an -minimally complete -abstraction of if is a complete -abstraction of and there is no global schema query such that is a complete -abstraction of and . Similarly, we say that a global schema query is an -maximally sound -abstraction of if is a sound -abstraction of and there is no global schema query such that is a sound -abstraction of and resp., .
Example 5. Consider again Example 1. Queries and are, respectively, the UCQ-minimally complete and UCQ-maximally sound -abstraction of . ⃤
Depending on the chosen language , it may be the case that no -minimally complete or -maximally sound -abstraction exists (see again Example 1 for some concrete cases). Moreover, even if one such abstraction exists, it may not be unique. For some classes of queries, however, one can show that a -maximally sound (resp., -minimally complete) -abstraction of exists, then it is unique up to -equivalence. This is the case, for example, of the class of UCQs for which, if a UCQ-maximally sound (resp., UCQ-minimally complete) -abstraction of exists, then it is unique up to -equivalence. Thus, in the following, we simply talk about the UCQ-maximally sound and the UCQ-minimally complete -abstraction of a source query . Other classes of queries with this properties will be introduced in the subsequent sections.
In the next sections, we will study -abstraction for data integration systems of a specific form, namely where (i) the mapping is of type GLAV or special cases of GLAV, and (ii) if not otherwise stated, the set of axioms of both the global schema and the source schema is empty. Also, we will limit our analysis to abstractions of UCQ source queries.
4. View-based query processing and query abstraction
It is well-known that there is a relationship between data integration and view-based query processing, grounded on the idea that the sources of a LAV data integration systems can be considered as views defined over the global schema, in particular sound views (Lenzerini, 2002). In this section, we take another approach and establish a relationship between GAV data integration systems and views, based on the idea that the elements of the global schema can be considered as views defined over the source schema.
This section is organized as follows. We first recall the basic notions about view-based query processing. Then, in Section 4.1 we make clear the relationship between GAV data integration systems and views, while in Section 4.2 we establish the connection between abstractions and rewriting queries using views. Finally, in Sections 4.3 and 4.4 we use the above connection to introduce results for abstraction and view-based query processing, respectively. All the results presented in this section appear in Cima et al. (2021).
View-based query processing is a general term denoting several tasks related to the presence of views in databases. A set of views over a schema is constituted by a finite set of view predicate symbols, where each has a specific arity, and an associated view definition , i.e., a query over of the same arity of V. An extension of a view V is simply a set of facts for V, and a -extension is constituted by an extension for each view in . Given a -database D, we denote by the -extension . In what follows, we use the term views to indicate a set of views in which all view definitions are queries expressed in the query language .
Two particular notions have been subject to extensive investigations in the view-based processing literature, namely view-based query rewriting and view-based query answering (Calvanese et al., 2000, 2007b).
In the former notion, originated in Levy et al. (1995), we are given a query over a schema and a set of views over , and the goal is to reformulate into a query , called a -rewriting, in terms of the view predicate symbols of . We obtain different variants of -rewritings depending on the relationship between and we aim at. We call (i) a -rewriting of under exact views, or simply -rewriting of , if for every -database D it holds that , (ii) an exact -rewriting of if for every -database D it holds that . Note that, if we fix a specific query language for expressing -rewritings, we might lose power in expressing -rewritings. In this case, a reasonable goal is to compute -rewritings expressible in that are “maximal” in the class . Formally, we say that a query is an -maximal -rewriting of , if (i) is a -rewriting of ; and (ii) there is no such that (a) q1 is a -rewriting of , (b) for each -database D, and (c) there is a -database D for which .
As argued in Nash et al. (2010), given and , the problem of checking whether there exists an exact -rewriting of (called losslessness with respect to rewriting Calvanese et al., 2007b) is equivalent to the problem, called view determinacy (Nash et al., 2010), of checking whether is determined by , denoted , i.e., whether implies for each pair of -databases D1 and D2. Indeed, on the one hand, if , then the function associating to each the tuples , for each -database D, is an exact -rewriting of , on the other hand, if , then such is not a function, and hence an exact -rewriting of cannot exist.
In the view-based query answering, originated in Duschka and Genesereth (1997), besides and we are also given a -extension , and the goal is to compute the so-called certain answers of w.r.t. and , denoted by , which are those tuples of constants such that for each -database D satisfying . We denote by the query over that, for every -extension , computes the certain answers of w.r.t. and , and we call the perfect -rewriting of under sound views, or simply perfect -rewriting of .
4.1. View-based query processing and data integration
We start by describing how to obtain, from any data integration system with PGAV mapping, a suitable set of UCQ views2 , and, viceversa, from any set of UCQ views , a suitable data integration system with PGAV mapping.
For a data integration system with , the set of UCQ views is such that (i) the set of view symbols coincides with , and (ii) for each view symbol g, the associated view definition is the following UCQ over :
where we have one disjunct for each mapping assertion in of the form . Note that, if , then all view definitions in are CQs.
Example 6. Let be a data integration system such that with:
Then, the UCQ views over is , where and . ⃤
For a set of UCQ views over a schema , the data integration system is such that (i) coincides with the view predicate symbols in , (ii) has no axiom, and (iii) is defined as follows: for each view symbol and for each CQ that is a disjunct in the UCQ , the mapping includes a mapping assertion of the form: Note that, in general, . However, if is a set of CQ views, then .
Example 7. Let be a set of UCQ views over such that: and .
Then, the data integration system is , where and with:
⃤
For a data integration system with PGAV mapping and a set of UCQ views , the pair is said to be coherent if (i) the schema over which the set of views is defined and the source of coincide, and (ii) or . In what follows, when we talk about a coherent pair , we use to denote the common schema between and .
Based on the relationship between and , the following proposition provides a connection between existence of perfect abstractions and existence of exact rewritings.
Proposition 1. [(Cima et al., 2021, Proposition 1)] If is a coherent pair and is an -query, then there exists a perfect -abstraction of if and only if there exists an exact -rewriting of .
4.2. Abstractions and rewritings of DD≠
We now turn our attention to a concrete class of queries, namely DD≠. From now on, when we use , we refer to a sublanguage of DD≠. By exploiting well-known results, we provide connections between the notion of -abstractions and -rewritings in the context of DD≠ and its sublanguages. To this end, we first introduce some terminology.
Given a mapping relating to and a -query q in a certain query language , the -unfolding of q (Lenzerini, 2002), denoted by , is the -query obtained by replacing each atom α occurring in the expression corresponding to q by the logical disjunction of all the left-hand sides of the mapping assertions in having the predicate symbol of α in the right-hand side (being careful to use unique variables in place of those variables that appear in the left-hand side of the mapping assertions but not in the right-hand side of those).
Given a set of UCQ views over and a -query q in a certain query language , the -expansion of q (Levy et al., 1995), denoted by , is the -query obtained by replacing each atom α occurring in in the expression corresponding to q by the view definition associated to the view predicate name of α (again, being careful to use unique variables in place of those variables that appear in the bodies of the view but not in the heads of those).
Proposition 2. [(Cima et al., 2021, Proposition 2)] If is a coherent pair, is an -query in , and q is a query in , then q is a sound (resp., perfect) -abstraction of if and only if q is a -rewriting (resp.,exact -rewriting) of .
Actually, as shown in Duschka and Genesereth (1998, Lemma 1), if allows for the union operator, then for any pair of UCQ views over and query over , if an -maximal -rewriting of exists, then it is unique up to -equivalence, and, moreover, it coincides with the perfect -rewriting of .3 From Proposition 2 and the above observation, we can derive the following result.
Corollary 1. [(Cima et al., 2021, Corollary 1)] If is a coherent pair and allows for the union operator, then for every pair of queries , we have that q is the -maximally sound -abstraction of if and only if q is the perfect -rewriting of .
By exploiting the above provided relationships, we are now ready to investigate how results and techniques from the view-based query processing literature can be directly translated into results and techniques in the context of abstraction, and viceversa.
4.3. From view-based query processing to abstraction
By combining Proposition 1 with a well-known undecidability result about view determinacy, we can derive a negative result about an arguably fundamental problem for the notion of abstraction, namely the existence problem (with no restrictions on the query language to express perfect abstractions) of perfect abstractions, even in very restricted settings.
Theorem 2. [(Cima et al., 2021, Theorem 2)] Given a data integration system with and a CQ -query , checking whether there exists a perfect -abstraction of is undecidable.
By exploiting Corollary 1, we now illustrate how to use off-the-shelf algorithms for rewriting queries in the presence of views as algorithms for computing abstractions. By results of Levy et al. (1995), for CQ views , perfect -rewritings of UCQs can be always expressed as UCQs, and can be always computed [e.g., by means of the bucket algorithm (Levy et al., 1996) or the MiniCon algorithm (Pottinger and Halevy, 2001)]. Thus Corollary 1 implies that, given a data integration system with and a UCQ -query , we can compute the UCQ-maximally sound -abstraction of as follows: (i) compute , and (ii) compute and return the UCQ corresponding to the perfect -rewriting of .
Corollary 2. [(Cima et al., 2021, Corollary 2)] If is a data integration system with SPGAV mapping and is a UCQ -query, then the UCQ-maximally sound -abstraction of exists and is computable.
Things get more complicated when we consider a data integration system with PGAV mappings, which are clearly more expressive than SPGAV, for which is a set of UCQ views, rather than CQ views. Indeed, for UCQ views , UCQ-maximal -rewritings of CQs are not guaranteed to exist (Duschka and Genesereth, 1998; Afrati and Chirkova, 2019), and thus, in general, perfect -rewritings of CQs are not expressible as UCQs. However, the perfect -rewritings of UCQs (actually, even of Datalog queries) can always be expressed in DD≠, and can always be computed using the technique presented in Duschka and Genesereth (1998). Thus, Corollary 1 implies that, given a data integration system with and a UCQ -query , we can compute the DD≠-maximally sound -abstraction of as follows: (i) compute , and (ii) compute and return the DD≠ query corresponding to the perfect -rewriting of .
Corollary 3. [(Cima et al., 2021, Corollary 3)] If is a data integration system with PGAV mapping and is a UCQ -query, then the DD≠-maximally sound -abstraction of exists and is computable.
4.4. From abstraction to view-based query processing
As already observed, Duschka and Genesereth (1998) and Afrati and Chirkova (2019) show that for a given set of UCQ views, UCQ-maximal -rewritings of CQs may not exist. Combined with an observation made above, this means that perfect -rewritings of CQs are in general not expressible as UCQs. We point out that the CQ used to prove such results contain more than one join existential variable. As a consequence, in the case of UCQ views , it is still open whether (i) the result holds even for with just one join existential variable (ii) perfect -rewritings of UCQJFEs are expressible as UCQs. By combining Corollary 1 with results of Cima et al. (2019) (that we will discuss in Section 5), we can actually answer positively to both questions.
Corollary 4. [(Cima et al., 2021, Corollary 4)] For a set of UCQ views, the UCQ-maximal -rewritings of may not exist, even if is a CQ with one join existential variable.
On the other hand, in Section 5, we will show that for a data integration systems with PGAV mapping, UCQ-maximally sound -abstractions of UCQJFEs are guaranteed to exist, and we will provide an algorithm to compute them (Theorem 5). Thus, given a set of UCQ views over a schema and a UCQJFE -query , we can compute the perfect -rewriting of as follows: (i) compute , and (ii) compute and return the UCQ-maximally sound -abstraction of . This leads to the following positive result for -rewritings of UCQJFEs.
Corollary 5. [(Cima et al., 2021, Corollary 5)] If is a set of UCQ views and is a UCQJFE -query, then the perfect -rewriting of is computable and can be expressed as a UCQ.
5. UCQ abstractions
In this section we investigate the problem of checking the existence of abstractions in the class UCQ, and of their computation. We first study the case of UCQ-minimally complete -abstractions, then we switch to UCQ-maximally sound -abstractions, and finally we tackle perfect -abstractions in the class UCQ. We observe that all the results presented in this section appear in Cima et al. (2019).
On the positive side, we show that UCQ-minimally complete abstractions always exist, by providing an algorithm to compute them. In a nutshell, given a data integration system and a UCQ , an algorithm to compute the UCQ minimally-complete -abstraction of returns the union of CQs of the form obtained by simply “applying” the mapping to each CQ in , using ⊤ to bind the distinguished variables that are not involved in the application of to . Formally, applying the GLAV mapping to a CQ q means to chase (Fagin et al., 2005) the atoms in q by using the tuple generating dependencies corresponding to the assertions in .
Theorem 3. [(Cima et al., 2019, Theorem 13)] The UCQ-minimally complete -abstraction of always exists and is computable.
On the negative side, the following shows that UCQ-maximally sound abstractions may not exist.
Theorem 4. [(Cima et al., 2019, Theorem 16)] The UCQ-maximally sound -abstractions of may not exist if at least one of the following is true:
(a) contains a join existential variable;
(b) contains a LAV mapping assertion;
(c) contains a non-PGAV mapping assertion.
Interestingly, in order to illustrate the case (a) of the above theorem we can refer to a slight modification of the data integration system introduced in Example 1. In particular, let be obtained from by removing from the mapping m1, and consider the query of Example 1. Note that and contains a join existential variable, x. Clearly, removing m1 has no impact on the abstraction of . Thus, as already discussed in Example 1, there exists no UCQ-maximally sound -abstraction of .
Motivated by Theorem 4, we next introduce a specific scenario, that we call restricted, obtained from the general one by limiting the mapping language to PGAV, and to be UCQJFEs. It can be shown that for such a restricted scenario, UCQ-maximally sound abstractions always exist. Intuitively, the latter can be derived by showing that for any UCQJFE and data integration system with , a CQ-maximally sound -abstraction of may comprise at most atoms, where is an integer that depends on the number of atoms occurring in and the number of mapping assertions occurring in . Hence, given a data integration system with PGAV mapping and an UCQJFE , an algorithm to compute the UCQ-maximally sound -abstraction of simply returns the union of all CQs comprising at most atoms, that are sound -abstractions of . The crucial observation here is that in order to check whether is a sound -abstraction of , it is sufficient to check whether , which is decidable, since both and are UCQs (Sagiv and Yannakakis, 1980).
Theorem 5. [(Cima et al., 2019, Theorem 21)] In the restricted scenario, the UCQ-maximally sound -abstractions of always exists and is computable.
To conclude the section, we provide the last positive result about perfect abstractions in the class UCQ. Namely, we show that checking whether there exists a UCQ that is the perfect -abstraction of is decidable. In particular, given a data integration system with GLAV mapping and a UCQ , an algorithm to decide whether there exists a UCQ that is a perfect -abstraction of proceeds as follows. First, it computes the query that is the UCQ-minimally complete -abstraction of . Then, it checks whether is a sound abstraction of (as discussed above). If the answer is negative, then there exists no UCQ that is a perfect -abstraction of . If the answer is positive, then is actually a UCQ, and is the perfect -abstraction of . Thus the algorithm also solves the computation problem for perfect abstractions in the UCQ language.
Theorem 6. [Cima et al. (2019)] Checking whether there exists a query q in the class UCQ that is the perfect -abstraction of is decidable. Moreover, there is an algorithm that computes q, whenever it exists.
6. Monotone abstractions
The notion of monotonicity defines a very natural class of queries that is popular in the field of databases and knowledge representation alike. The intuition behind monotone queries is simple: a query q is monotone if, whenever the data we posses increases, the answers for q do not decrease. In the literature, however, this notion has been formalized in two distinct ways. In the context of databases, a -query q is monotone if, for every pair of -databases D, D′ such that D⊆D′, we have qD⊆qD′. Even very simple FOL queries can be shown not to be monotone under this notion. On the other hand, in the context of mathematical logic, the notion of monotonicity comes in a different flavor: a -query q is monotone, if, for every every set of interpretations Σ, Σ′ for such that Σ⊆Σ′, we have qΣ⊆qΣ′. We observe here that, under the semantics of certain answers, FOL queries are monotone in this sense.
To define the notion of monotone queries in the context of a data integration system, we use the notion of monotonicty from logic. A -query q is monotone in the context of a data integration system if for every pair D, D′ of -databases, implies . In the following, we use 𝔐J to denote the class of monotone queries in the context of , and when is understood, we simply use 𝔐.
This notion of monotonicity is natural yet broad enough to characterize some of the most popular classes of queries. For example, it is trivial to see that queries evaluated under certain answer semantics are monotone. In the light of this consideration, it is natural to ask whether perfect and approximated abstractions in the class of monotone queries always exist for a given class of source queries and whether they can be computed. Moreover, one can show that, whenever an 𝔐-maximally sound (resp., 𝔐-minimally complete) -abstraction exists, then it is unique. Therefore, from now on, given a source query , we will talk about the 𝔐-maximally sound (resp., the 𝔐-minimally complete) -abstraction of .
In the remainder of this section, we survey recent results on monotone abstractions of UCQs presented in Cima et al. (2022). We introduce a language of monotone queries, called DDK, with attractive computational properties (Section 6.1). For the case of data integration systems with no axioms in both the global schema and in the source schema, we show that minimally complete and maximally sound monotone abstractions for UCQ source queries always exist, and are expressible in DDK. From these results, we also derive the decidability of checking whether a perfect monotone abstraction of a given source query exists (Section 6.2).
6.1. A language for monotone abstractions
Monotone queries form a natural yet expressive class of queries. Unsurprisingly, perfect and approximated monotone abstractions require a suitably expressive query language. We now introduce one such language and discuss some of its most compelling computational characteristics. The language, called DDK, is based on disjunctive Datalog, extended with an epistemic operator. We present it in a form specifically tailored for querying data integration systems.
Assume a data integration system and an alphabet of predicate symbols Int, called intensional predicate symbols, disjoint from the alphabets of and . We now consider the case where the logical theories corresponding to both and may have a nonempty set of axioms.
A DDK query for includes a set of rules, each one of two possible forms:
• the typical form of disjunctive Datalog, i.e.,
where b1, …, bm and i1, …, in are atoms on intensional predicates, and
• a new form specified as follows
where each ψi is a conjunction of atoms over Int, and each ϕi is of the form , with a conjunction of atoms over , and a conjunction of inequalities over variables in only.
An n-ary DDK query q for is a pair where is a finite set of DDK rules, called the definition of q, and Ans is an n-ary intensional predicate in Int, called the answer predicate of q.
Answers for DDK queries are defined based on the notions presented in Calvanese et al. (2007a). An interpretation for q is a pair I = (E, f), where E is a set of interpretations for , and f is an interpretation for Int with domain C. An interpretation I = (E, f) satisfies a DDK rule ρ of q (written I⊧ρ) if the following conditions hold:
• If ρ is a formula of the form (1), then I⊧ρ if f⊧ρ, i.e., f satisfies the implication in (1).
• If ρ is a formula of the form (2), then I⊧ρ if for all tuples of values in C, if I satisfies the epistemic formula , then there is j such that is true in f.
An interpretation I for q is called a model of q if all the rules in the definition of q are satisfied by I. It should be clear that, under this definition of semantics, K represents the “knowledge” operator of the modal logic system S5. In other words, the formula Kα should be read as “α is known (i.e., logically implied) by the system”.
We are ready to define what is the answer of a DDK query with respect to and the -database D. Specifically, is a model of q}.
While a thorough analysis of DDK is outside the scope of the present work, we mention some of its most appealing characteristics. Firstly, we observe that DDK generalizes UCQs. In particular, every UCQ q of m disjuncts is equivalent to a DDK query with one rule of the form (2) where the disjuncts of q are in the scope of K. Secondly, every DDK query q over is monotone in the context of . Intuitively, monotonicity follows from a simple form of stratification where certain answers to UCQs (rules of the form (2)) and recursive computations (rules (1)) never mix. In turn, this simple form of stratification guarantees that answering q over boils down to the following: (i) computing certain answers for the UCQs in the scope of K in the left-hand side of rules of the form (1) in q, and (ii) computing the answers for the remaining rules (form (2)) over the result of the previous step. Monotonicity follows from the monotonicity of certain answers to UCQs, and from the fact that the rules of the form (2) define a monotone query. These considerations indicate a third appealing characteristic of DDK. Specifically, the decidability of answering a DDK query q w.r.t. and D depends exclusively on the decidability of answering UCQs over , as the following proposition shows.
Proposition 3. [(Cima et al., 2022, Proposition 2)] Answering DDK queries w.r.t. and D is decidable if and only if computing the certain answers of UCQs w.r.t. and D is decidable.
These results sharply contrast with similar results obtained for plain (non-disjunctive) Datalog. In particular, the undecidability of the latter can be proved even in the case of global schema axioms expressed in very simple Description Logics of the DL-Lite family (see, e.g., Levy and Rousset, 1998; Calvanese and Rosati, 2003).
6.2. Monotone abstractions via DDK
We now turn our attention to monotone abstractions expressed in DDK. We start by observing that, in terms of computational complexity, DDK perfectly fits the problem of computing approximated abstractions, as the following proposition shows.
Proposition 4. [(Cima et al., 2022, Proposition 3)] There exists a data integration system with PGAV mapping and a UCQ such that answering the 𝔐-maximally sound -abstraction of is coNP-hard in data complexity.
In the remainder of this section, we show that DDK is well-suited to express monotone abstractions, both perfect and approximated. In discussing this issue, we go back to our assumption of dealing with data integration systems with no axioms in both the global and the source schema. So, in what follows, we implicitly deal with a data integration system , where and have no axioms, and a UCQ -query , where , for i = 1, …, n.
6.2.1. 𝔐-maximally sound abstractions
In Cima et al. (2022), it is shown that DDK can always express 𝔐-maximally sound -abstractions of UCQs, by illustrating a technique that, given query , builds a set of DDK rules whose intensional predicates are the predicates in , and then uses such rules to construct the 𝔐-maximally sound -abstractions of as a DDK query. We do not describe the technique in detail here. Rather, we use an example to give an intuition of the construction.
Example 8. Given the following mapping in :
is the following set of DDK rules:
Intuitively, the rules of specify, for the various facts over that are certain, i.e., that are known to hold, the queries over the sources that generate them. For example, the first rule of specifies that, if a constant is known to satisfy g1(x, x), then this knowledge derives either from the answers to the source query {x|∃y.s1(x)∧s2(x, y)} or from the answers to the source query {x|s1(x)∧s3(x, x)}. As another example, the second rule of specifies that the pairs of distinct constants x, y known to satisfy g1(x, y) derive from the query {x, y|s1(x)∧s3(x, y)}. It can be shown that this is crucial for ensuring that the abstraction of queries involving the join of s1 and s3, which is based on the certain answers of g1, do not include data deriving from source queries whose abstraction is based on the certain answers of the projection of g1. Finally, the third rule of takes care of those constants x known to satisfy g1(x, y), for some, not necessarily known, y. Such constants may derive from each of source queries above.
Using the notion of , we can immediately obtain the 𝔐-maximally sound -abstraction of , by adding to the set constituted by one rule of the form for each disjunct in .
Proposition 5. [(Cima et al., 2022, Theorem 2)] The DDK query is the 𝔐-maximally sound -abstraction of .
In the light of Proposition 5 and from the existence of an algorithm to compute , we obtain the following.
Theorem 7. [(Cima et al., 2022, Theorem 2)] The 𝔐-maximally sound -abstraction of always exists, is computable, and can be expressed in DDK.
6.2.2. 𝔐-minimally complete abstractions
We show that DDK can always express 𝔐-minimally complete -abstractions of UCQs.
Let us first introduce a useful notion. Given a CQ , Saturate(q) denotes the UCQ with inequalities obtained as follows. For each possible unifier μ on the variables in such that for each , Saturate(q) contains a query obtained from μ(q) by adding an inequality atom (t1≠t2) for each pair of distinct variables t1, t2 occurring in μ(q). For a UCQ Q, we denote by Saturate(Q) the UCQ with inequalities consisting of the union of Saturate(q), for each disjunct q of Q. It is easy to see that Saturate(Q) is equivalent to Q, for every UCQ Q.
Consider a disjunct qh in in . Clearly, qh is a CQ with inequalities of the form , where are inequality atoms. Let denote the result of chasing the set of relational atoms occurring in qh with . Let ρqh denote the DDK rule . Finally, let qc denote the DDK query consisting of all the rules ρqh for the various qh in and with answer predicate Ans. We can now prove the following.
Proposition 6. [(Cima et al., 2022, Theorem 1)] qc is the 𝔐-minimally complete -abstraction of .
The following statement is a straightforward consequence of Proposition 6.
Theorem 8. [(Cima et al., 2022, Theorem 1)] The 𝔐-minimally complete -abstraction of always exists, is computable, and can be expressed in DDK.
6.2.3. Perfect monotone abstractions
From the results presented above, we can derive an algorithm for checking whether there exists a query in 𝔐 that is the perfect -abstraction of . In particular, observe that if the perfect -abstraction of can be expressed as a query in 𝔐, then it is -equivalent to the 𝔐-minimally complete -abstraction of . Then, from Proposition 6 we know that, in order to check whether there exists a query in 𝔐 that is the perfect -abstraction of , we have to check whether is equivalent to qc modulo .
To this end, we observe the following. There exists a UCQ with inequalities -query qmin such that , for every -database D. Moreover, qmin is computable. These two properties result from being a GLAV data integration system with no source and global schema axioms, and from the specific form of qc. Therefore, in order to check whether there exists a query in 𝔐 that is the perfect -abstraction of , we just need to check whether . The next claim follows from these considerations.
Theorem 9. [(Cima et al., 2022, Theorem 3)] Checking whether there exists a query q in the class 𝔐 that is the perfect -abstraction of is decidable. Moreover, there is an algorithm that computes q, whenever it exists.
7. Non-monotone abstractions
So far, we have limited our analysis of the abstraction reasoning task by focusing on monotone query languages in the context of data integration systems. There exist, however, very simple scenarios in which the perfect abstraction can only be expressed by means of a non-monotone query.
Example 9. Let be such that the global schema has the predicates {A/1, B/1, C/1}, the source schema has the predicates {s1/1, s2/1}, and , where:
Consider the query . One can verify that the perfect -abstraction of is the non-monotone query such that, given an -database D, returns those x for which either (A(x)∧¬B(x)) or C(x) is known to be true, i.e. holds in every -database B such that .
Motivated by the above example, in this section we summarize the most salient aspects of the results in Cima et al. (2020), which investigates the problem of finding perfect (resp. minimally complete, maximally sound) abstractions expressed in the query language EQL-Lite(UCQ).4 For instance, refer to Example 9. The perfect -abstraction of written there in natural language can be formulated through the EQL-Lite(UCQ) query . As in the case of the UCQ and the 𝔐 classes, it can be shown that if an EQL-Lite(UCQ)-maximally sound (resp., EQL-Lite(UCQ)-minimally complete) -abstraction of exists, then it is unique up to -equivalence. Thus, in what follows, we will simply talk about the EQL-Lite(UCQ)-maximally sound (resp., EQL-Lite(UCQ)-minimally complete) -abstraction of .
A natural question that arises is whether “best” abstractions in the EQL-Lite(UCQ) query language always exist. Unfortunately, the following theorem shows that this is not the case for both EQL-Lite(UCQ)-minimally complete abstractions and EQL-Lite(UCQ)-maximally sound abstractions.
Theorem 10. [(Cima et al., 2020, Theorems 1 and 2)] Both the EQL-Lite(UCQ)-minimally complete -abstractions of and the EQL-Lite(UCQ)-maximally sound -abstractions of may not exist.
Due to the above negative result, which holds already for CQJFE queries and data integrations systems with PGAV mappings, we now explore two alternative restricted scenarios. The former weakens the target query language for expressing abstractions by considering a fragment of EQL-Lite(UCQ), whereas the latter weakens the mapping language by considering a special case of GLAV. In both the restricted scenarios, we assume that source queries are CQs rather than UCQs.
7.1. A restricted non-monotone query language
We now consider the problem of finding abstractions expressed in EQL-Lite−(UCQ), which corresponds to the fragment of EQL-Lite(UCQ) where both nested negation and union operators are disallowed. More formally, an EQL-Lite−(UCQ) query q is an expression of the form where is an EQL formula built according to the following syntax:
with ϱ being a disjunction of conjunction of atoms over possibly involving existentially quantified variables. For instance, the EQL-Lite(UCQ) query illustrated above, which corresponds to the perfect -abstraction of in Example 9, is not an EQL-Lite−(UCQ) query.
On the negative side, even in this scenario, maximally sound abstractions are not guaranteed to exist, and this holds already for CQJFE queries and data integrations systems with PGAV mappings.
Theorem 11. [(Cima et al., 2020, Theorem 2)] The EQL-Lite−(UCQ)-maximally sound -abstractions of may not exist.
On the positive side, we now provide an algorithm for computing EQL-Lite−(UCQ)-minimally complete -abstractions of CQs . The algorithm is similar to the one for the UCQ case (cf. Section 5), expect that all the atoms obtained when applying the mapping to the given CQ occur inside the scope of the epistemic operator K, binding also the existential variables coming from the input query. More precisely, given a data integration system and a CQ , the algorithm returns the EQL-Lite−(UCQ) query , where are the existential variables of occurring in , while are the fresh existential variables introduced when applying to . To see the difference with the UCQ case, recall Example 1 in the introduction and the CQ therein. While is the UCQ-minimally complete -abstraction of , the EQL-Lite−(UCQ) query {x∣∃y.K(g1(x, y))} returned by the above algorithm is a better complete approximation than , and is in fact the perfect -abstraction of .
Theorem 12. [(Cima et al., 2020, Theorem 5)] The EQL-Lite−(UCQ)-minimally complete -abstraction of a CQ always exists and is computable.
We further notice that the above algorithm returns queries that are monotone and that are expressible in DDK, thus proving that, without disjunction, the limited form of negation allowed in EQL-Lite−(UCQ) does not give more expressive power in finding minimally complete (and therefore also perfect) abstractions of CQs. On the contrary, it can be shown that inequalities give more expressive power in finding abstractions. In particular, there exist 𝔐-minimally complete -abstractions of CQs that cannot be expressed in EQL-Lite−(UCQ), whereas, as shown in the previous section, they can be expressed in DDK.
Given a query as returned by the above algorithm, it is always possible to compute a UCQ qu such that for every -database D. Thus, following the same line of reasoning as the one at the end of the previous section, in this scenario we can solve the computation problem for perfect abstractions of CQs.
Theorem 13. [Cima et al. (2020)] Checking whether there exists a query q in EQL-Lite−(UCQ) that is the perfect -abstraction of a CQ is decidable. Moreover, if it exists, then q is a monotone query and there is an algorithm that computes it.
7.2. One-to-one mapping
We now examine the problem of finding abstractions in the presence of data integration systems such that is a one-to-one mapping. A one-to-one mapping is a special case of GLAV, constituted by a set of assertions of the form , where and are single atoms without constants or repeated variables.
The first result is that the algorithm previously presented for computing EQL-Lite−(UCQ)-minimally complete abstractions of CQs can be also used for computing EQL-Lite(UCQ)-minimally complete abstractions of CQs for data integration systems with PGAV mapping.
Theorem 14. [(Cima et al., 2020, Theorem 3)] Under one-to-one mappings, the EQL-Lite(UCQ)-minimally complete -abstraction of a CQ always exists, is computable, and is a monotone query.
Thus, using exactly the same considerations done for the case of EQL-Lite−(UCQ), we can solve the computation problem for perfect abstractions in EQL-Lite(UCQ) of CQs under one-to-one mappings.
Theorem 15. [Cima et al. (2020)] Under one-to-one mappings, checking whether there exists a query q in EQL-Lite(UCQ) that is the perfect -abstraction of a CQ is decidable. Moreover, if it exists, then q is a monotone query and there is an algorithm that computes it.
We now turn to the sound case under one-to-one mappings. Specifically, in this scenario, while the existence of EQL-Lite(UCQ)-maximally sound -abstractions of CQs is still an open problem, we present an algorithm for computing EQL-Lite(UCQ)-maximally sound -abstractions of CQJFEs . Roughly speaking, given a data integration system with a one-to-one mapping and a CQJFE , as a first step the algorithm computes the EQL-Lite(UCQ)-minimally complete of and its UCQ reformulation qu such that for each -database D. Then, for each CQ q′ which is a disjunct of qu such that , the algorithm adds in conjunction to the body of the negation of the body of the EQL-Lite(UCQ)-minimally complete of q′. Informally, this last step prevents to return answers that are not answers of , guaranteeing soundness of the output query. For instance, recall Example 1, and let be the data integration system with a one-to-one mapping. The query returned by the algorithm is the EQL-Lite−(UCQ) query {x∣K(A(x))∧¬K(B(x))}, which is the EQL-Lite(UCQ)-maximally sound abstraction of .
Theorem 16. [(Cima et al., 2020, Theorem 4)] Under one-to-one mappings, the EQL-Lite(UCQ)-maximally sound -abstraction of a CQJFE always exists and is computable.
We conclude this section with the following observation. The algorithms sketched above for computing “best” abstractions always return an EQL-Lite−(UCQ) query. This directly implies that, under one-to-one mappings, the query languages EQL-Lite(UCQ) and EQL-Lite−(UCQ) have the same expressive power in finding all three kinds of abstractions (perfect, minimally complete, and maximally sound).
8. Open problems
We have provided an overview of data abstraction, and we have illustrated some results obtained in recent years on computing abstractions. We conclude the paper by discussing a set of issues related to abstractions that deserve more investigation.
8.1. Data quality
While data quality is one the main issues proposed in Data-centric AI, there is no general and well-established methodology for leveraging data quality for improving Machine Learning methods.
As pointed out in Chen et al. (2021), poor data quality has a direct impact on the performance of the machine learning system that is built on the data. It is therefore important to devise techniques for validating the quality of both training and testing datasets. Recent work in this direction shows a strong correlation between the quality of the datasets and the performance of the machine learning system, and demonstrates that a rigorous evaluation of data quality is necessary for guiding the quality improvement of machine learning. We believe that formal methods like data abstraction can provide some contributions toward this goal. For example, by helping in making the semantics of training data explicit, abstraction can provide support for recognizing biases or other problems in the data used to train a Machine Learning Model. Making concrete steps in this direction is a stimulating research challenge.
8.2. Languages for abstractions
A crucial issue related to abstraction is to compute perfect and approximated abstractions within specific classes of queries. For the fundamental class UCQ, the decidability of checking whether there exists a UCQ-maximally sound abstraction of a UCQ source query is still open. More generally, there are many interesting classes of queries that can be used to express abstractions, and for which it would be interesting to compute perfect, or approximated abstractions. For example, in the case of graph databases as virtual views, relevant classes of queries for abstractions include regular path queries, or two-way conjunctive regular path queries.
8.3. Abstraction and monotonicity
In this paper we have discussed the use of DDK to express monotone abstractions of source queries in the class UCQ. It would be interesting to investigate which is the minimal expressive power needed for capturing perfect and approximated monotone abstractions of source queries. Also, it is not difficult to see that there are queries for which the perfect abstraction is non-monotone. Although first results on non-monotone abstractions have appeared in Cima et al. (2020), the issue of checking the existence of and computing non-monotone abstractions is largely unexplored.
8.4. Expressive source queries
The majority of work on abstraction so far focused on source queries in the class UCQ. It would be interesting to address the problem of computing perfect and approximated abstractions of source queries expressed in more expressive languages such as Datalog. More expressive mapping languages (e.g., UCQ with inequalities in the GLAV type of mapping) also deserve attention.
8.5. Axioms
The computation of abstractions in the presence of axioms in the global schema or in the source schema is another interesting problem to study. First results in this direction appeared in Cima (2017), Lutz et al. (2018), and Cima et al. (2019), but the topic requires a more thorough analysis.
8.6. Reverse engineering
Abstraction has also interesting connections with the reverse-engineering problem (Barceló and Romero, 2017). When casted in data integration, given a source database D and set P of tuples, this problem aims at finding a global schema query q that captures P, i.e., such that the answers of q with respect to D captures the tuples in P. Despite the intuitive connection, a detailed analysis of the relationship between the two problems is missing.
8.7. User requirements
Finally, crucial aspects of abstractions, such as succinctness and clarity, have not been considered in this paper. More generally, issues related to the adequacy of the formulation of abstractions with respect to user requirements deserve greater attention.
Author contributions
All authors listed have made a substantial, direct, and intellectual contribution to the work and approved it for publication.
Funding
This work has been partially supported by MUR under the PRIN 2017 project HOPE (prot. 2017MMJJRE), by the EU under the H2020-EU.2.1.1 project TAILOR, grant id. 952215, and by MUR under the PNRR project PE0000013-FAIR.
Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Publisher's note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Footnotes
1. ^In principle, we could also consider databases that are infinite structures.
2. ^When we refer to UCQ views, we in fact assume that view definitions are UCQs without repeated variables in the target list. We refer to Afrati and Chirkova (2019) for the complications that can arise when this assumption is removed.
3. ^This is not the case when view definitions are expressed as regular path queries rather than UCQs (Calvanese et al., 2002).
4. ^Actually, we consider the slightly restricted version of EQL-Lite(UCQ) which does not allow the use of (in)equalities.
References
Abedjan, Z., Golab, L., and Naumann, F. (2017). “Data profiling: a tutorial,” in Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD 2017) (Chicago, IL), 1747–1751. doi: 10.1145/3035918.3054772
Afrati, F. N., and Chirkova, R. (2019). Answering Queries Using Views. Synthesis Lectures on Data Management, 2nd ed. San Rafael, CA: Morgan and Claypool Publishers. doi: 10.1007/978-3-031-01871-8
Barceló, P., and Romero, M. (2017). “The complexity of reverse engineering problems for conjunctive queries,“ in Proceedings of the Twentieth International Conference on Database Theory (ICDT 2017), Volume 68 of Leibniz International Proceedings in Informatics, 7:1–7:17. Available online at: https://www.dagstuhl.de/en/publications/lipics (accessed June 15, 2023).
Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., and Rosati, R. (2007a). “EQL-lite: effective first-order query processing in description logics,”in Proceedings of the Twentieth International Joint Conference on Artificial Intelligence (IJCAI 2007) (Hyderabad), 274–279.
Calvanese, D., De Giacomo, G., Lenzerini, M., and Vardi, M. Y. (2000). “What is view-based query rewriting?” in Proceedings of the Seventh International Workshop on Knowledge Representation meets Databases (KRDB 2000), Volume 29 of CEUR Electronic Workshop Proceedings, 17–27. Available online at: http://ceur-ws.org/ (accessed June 15, 2023).
Calvanese, D., De Giacomo, G., Lenzerini, M., and Vardi, M. Y. (2002). “Lossless regular views,”in Proceedings of the Twenty-First ACM SIGACT SIGMOD SIGART Symposium on Principles of Database Systems (PODS 2002) (Madison, WI: ACM), 58–66. doi: 10.1145/543613.543646
Calvanese, D., De Giacomo, G., Lenzerini, M., and Vardi, M. Y. (2007b). View-based query processing: on the relationship between rewriting, answering and losslessness. Theor. Comput. Sci. 371, 169–182. doi: 10.1016/j.tcs.2006.11.006
Calvanese, D., and Rosati, R. (2003). “Anwering recursive queries under keys and foreign keys is undecidable,”in Proceedings of the Tenth International Workshop on Knowledge Representation meets Databases (KRDB 2003), Volume 79 of CEUR Electronic Workshop Proceedings. Available online at: http://ceur-ws.org/ (accessed June 15, 2023).
Chen, H., Chen, J., and Ding, J. (2021). Data evaluation and enhancement for quality improvement of machine learning. IEEE Trans. Reliab. 70, 831–847. doi: 10.1109/TR.2021.3070863
Cima, G. (2017). “Preliminary results on ontology-based open data publishing,”in Proceedings of the Thirtieth International Workshop on Description Logics (DL 2017), Volume 1879 of CEUR Electronic Workshop Proceedings. Available online at: http://ceur-ws.org/ (accessed June 15, 2023).
Cima, G., Console, M., Lenzerini, M., and Poggi, A. (2021). “Abstraction in data integration,”in Proceedings of the Thirty Sixth IEEE Symposium on Logic in Computer Science (LICS 2021) (Rome: IEEE), 1–11. doi: 10.1109/LICS52264.2021.9470716
Cima, G., Console, M., Lenzerini, M., and Poggi, A. (2022). “Monotone abstractions in ontology-based data management,”in Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI 2022), 5556–5563. doi: 10.1609/aaai.v36i5.20495
Cima, G., Lenzerini, M., and Poggi, A. (2017). “Semantic technology for open data publishing,”in Proceedings of the Seventh International Conference on Web Intelligence, Mining and Semantics (WIMS 2017) (Amantea), 1. doi: 10.1145/3102254.3102255
Cima, G., Lenzerini, M., and Poggi, A. (2019). “Semantic characterization of data services through ontologies,”in Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI 2019) (Macao), 1647–1653. doi: 10.24963/ijcai.2019/228
Cima, G., Lenzerini, M., and Poggi, A. (2020). “Non-monotonic ontology-based abstractions of data services,”in Proceedings of the Seventeenth International Conference on Principles of Knowledge Representation and Reasoning (KR 2020), 243–252. doi: 10.24963/kr.2020/25
Duschka, O. M., and Genesereth, M. R. (1997). “Answering recursive queries using views,”in Proceedings of the Sixteenth ACM SIGACT SIGMOD SIGART Symposium on Principles of Database Systems (PODS 1997) (New York, NY), 109–116. doi: 10.1145/263661.263674
Duschka, O. M., and Genesereth, M. R. (1998). “Query planning with disjunctive sources,”in Proceedings of the AAAI-98 Workshop on AI and Information Integration (Cambridge, MA: AAAI/The MIT).
Eiter, T., Gottlob, G., and Mannilla, H. (1997). Disjunctive datalog. ACM Trans. Database Syst. 22, 364–418. doi: 10.1145/261124.261126
Fagin, R., Kolaitis, P. G., Miller, R. J., and Popa, L. (2005). Data exchange: semantics and query answering. Theor. Comput. Sci. 336, 89–124. doi: 10.1016/j.tcs.2004.10.033
Halevy, A. Y. (2001). Answering queries using views: a survey. Very Large Database J. 10, 270–294. doi: 10.1007/s007780100054
Lenzerini, M. (2002). “Data integration: a theoretical perspective,”in Proceedings of the Twenty-First ACM SIGACT SIGMOD SIGART Symposium on Principles of Database Systems (PODS 2002) (New York, NY: ACM), 233–246. doi: 10.1145/543613.543644
Levy, A. Y., Mendelzon, A. O., Sagiv, Y., and Srivastava, D. (1995). “Answering queries using views,”in Proceedings of the Fourteenth ACM SIGACT SIGMOD SIGART Symposium on Principles of Database Systems (PODS 1995) (San Jose, CA: ACM Press), 95–104. doi: 10.1145/212433.220198
Levy, A. Y., Rajaraman, A., and Ordille, J. J. (1996). “Querying heterogenous information sources using source descriptions,”in Proceedings of the Twenty-Second International Conference on Very Large Data Bases (VLDB 1996) (Mumbai), 251–262.
Levy, A. Y., and Rousset, M.-C. (1998). Combining Horn rules and description logics in CARIN. Artif Intell. 104, 165–209. doi: 10.1016/S0004-3702(98)00048-4
Lutz, C., Marti, J., and Sabellek, L. (2018). “Query expressibility and verification in ontology-based data access,”in Proceedings of the Sixteenth International Conference on the Principles of Knowledge Representation and Reasoning (KR 2018) (Tempe, AZ), 389–398.
Nash, A., Segoufin, L., and Vianu, V. (2010). Views and queries: aeterminacy and rewriting. ACM Trans. Database Syst. 35, 1–21. doi: 10.1145/1806907.1806913
Pottinger, R., and Halevy, A. Y. (2001). MiniCon: a scalable algorithm for answering queries using views. Very Large Database J. 10, 182–198. doi: 10.1007/s007780100048
Keywords: knowledge representation, abstraction, automated reasoning, data integration, data preparation
Citation: Cima G, Console M, Lenzerini M and Poggi A (2023) A review of data abstraction. Front. Artif. Intell. 6:1085754. doi: 10.3389/frai.2023.1085754
Received: 31 October 2022; Accepted: 30 March 2023;
Published: 23 June 2023.
Edited by:
Giovanni Sileno, University of Amsterdam, NetherlandsReviewed by:
Federica Mandreoli, University of Modena and Reggio Emilia, ItalyJoao Pita Costa, UNESCO International Research Center on Artificial Intelligence - IRCAI, Slovenia
Pablo Barcelo, Pontifical Catholic University of Chile, Chile
Copyright © 2023 Cima, Console, Lenzerini and Poggi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Antonella Poggi, poggi@diag.uniroma1.it