Automatic Text Simplification for German

Ebling, Sarah; Battisti, Alessia; Kostrzewa, Marek; Pfütze, Dominik; Rios, Annette; Säuberli, Andreas; Spring, Nicolas

doi:10.3389/fcomm.2022.706718

BRIEF RESEARCH REPORT article

Front. Commun., 23 February 2022

Sec. Psychology of Language

Volume 7 - 2022 | https://doi.org/10.3389/fcomm.2022.706718

This article is part of the Research TopicSimple and Simplified LanguagesView all 11 articles

Automatic Text Simplification for German

Sarah Ebling^*

Alessia Battisti

Marek Kostrzewa

Dominik Pfütze

Annette Rios

Andreas Säuberli

Nicolas Spring

Department of Computational Linguistics, University of Zurich, Zurich, Switzerland

The article at hand aggregates the work of our group in automatic processing of simplified German. We present four parallel (standard/simplified German) corpora compiled and curated by our group. We report on the creation of a gold standard of sentence alignments from the four sources for evaluating automatic alignment methods on this gold standard. We show that one of the alignment methods performs best on the majority of the data sources. We used two of our corpora as a basis for the first sentence-based neural machine translation (NMT) approach toward automatic simplification of German. In follow-up work, we extended our model to render it capable of explicitly operating on multiple levels of simplified German. We show that using source-side language level labels improves performance with regard to two evaluation metrics commonly applied to measuring the quality of automatic text simplification.

1. Introduction

Simplified language¹ is a variety of standard language characterized by reduced lexical and syntactic complexity, the addition of explanations for difficult concepts, and clearly structured layout. Two tasks deal with automatic processing of simplified language: automatic readability assessment and automatic text simplification (Saggion, 2017).

Automatic text simplification was initiated in the late 1990s (Chandrasekar et al., 1996; Carroll et al., 1998) and since then has been approached by means of rule-based and statistical methods. As part of a rule-based approach, the operations carried out typically include replacing complex lexical and syntactic units with simpler ones (Chandrasekar et al., 1996; Siddharthan, 2002; Gasperin et al., 2010; Bott et al., 2012; Drndarević and Saggion, 2012). A statistical approach (Specia, 2010; Zhu et al., 2010) generally conceptualizes the simplification task as one of converting a standard-language into a simplified-language text using machine translation techniques on a sentence level. The success of such approaches is contingent on the availability of high-quality sentence alignments.

Research on automatic text simplification is comparatively widespread for languages such as English (Zhu et al., 2010), Spanish (Saggion et al., 2015), Portuguese (Aluisio and Gasperin, 2010), French (Brouwers et al., 2014), Italian (Barlacchi and Tonelli, 2013), and other languages. For German, only few contributions exist. Research on simplified German has gained momentum in recent years due to a number of legal and political developments in German-speaking countries, such as the introduction of a set of regulations for accessible information technology (Barrierefreie-Informationstechnik-Verordnung, BITV 2.0) in Germany, the approval of rules for accessible information and communication (Barrierefreie Information und Kommunikation, BIK) in Austria, and the ratification of the United Nations Convention on the Rights of Persons with Disabilities (CRPD) in Germany, Austria, and Switzerland. In addition, two volumes on Easy Language appeared in the “Duden” series (Bredel and Maaß, 2016a,b), further highlighting the relevance of the topic. See Maaß (2020, Chapter 2.3) for a comprehensive overview of the situation in Germany.

The article at hand aggregates the work of our group in automatic processing of simplified German. We present four parallel corpora compiled and curated by our group. We report on the creation of a gold standard of sentence alignments from the four sources for evaluating five alignment methods on this gold standard. We used two of the corpora as a basis for the first sentence-based neural machine translation (NMT) approach toward automatic simplification of German. In follow-up work, we extended our model to render it capable of explicitly operating on multiple levels of simplified German.

More specifically, the contributions of the article at hand are:

• Overview of four parallel (standard/simplified German) corpora, of which automatically generated sentence alignments for one source are available for research purposes

• Gold standard of sentence alignments from the four sources

• Evaluation of automatic sentence alignment methods based on the gold standard

• First sentence-based NMT approach toward automatic simplification of German

• First multi-level simplification approach for German

The remainder of this article is structured as follows: Section 2.1 discusses approaches to automatic sentence alignment in the context of text simplification. Section 2.2 discusses parallel standard-/simplified-language corpora available for language pairs other than standard German/simplified German. Section 2.3 presents previous approaches to automatic text simplification. Sections 3 to 5 present our own contributions, consisting of compiling four standard German/simplified German parallel corpora (Section 3), creating a gold standard for automatic sentence alignment (Section 4) against which to measure existing automatic sentence alignment methods, and performing automatic text simplification on a sentence level (Section 5). Section 6 offers a conclusion and an outlook on future research.

2. Previous work

2.1. Automatic Sentence Alignment

Sentence alignment between standard- and simplified-language texts is an instance of monolingual sentence alignment. As such, it is unable to rely on well-established heuristics of bilingual sentence alignment based on, for example, sentence length (Gale and Church, 1991). The relation between source and target sentences in a standard-language/simplified-language document pair can be of the following types:

• 1:1, i.e., one standard-language sentence corresponding to one simplified-language sentence

• n:1 (with n>1), i.e., more than one standard-language sentence reduced to a single simplified-language sentence

• 1:n (with n>1), i.e., one standard-language sentence split up into multiple simplified-language sentences

• n:m (with n>1 and m>1), i.e., more than one standard-language sentence corresponding to more than one simplified-language sentence

• 1:0, i.e., a standard-language sentence omitted in the simplified-language text

• 0:1, i.e., a simplified-language sentence inserted compared to the standard-language text

This is visualized in Figure 1. Also shown in this figure is an example of a crossing alignment, i.e., an alignment where the order of information of the standard-language text is not the same as that of the simplified-language text (non-monotonicity).

FIGURE 1

Figure 1. Schematic depiction of source/target sentence relations in standard-language/simplified-language parallel texts.

A number of tools have been developed specifically for sentence alignment in the context of text simplification; among them are MASSAlign (Paetzold et al., 2017), CATS (Customized Alignment for Text Simplification) (Štajner et al., 2018), and LHA (Large-scale Hierarchical Alignment for Data-driven Text Rewriting) (Nikolov and Hahnloser, 2019).

MASSAlign is a hierarchical algorithm that uses a vicinity-driven approach. It employs a heuristic according to which the order of information is consistent on the standard- and simplified-language sides, allowing for reduction of the search space. In a first step, MASSAlign searches for alignments between paragraphs, and in a second, for sentence alignments within the aligned paragraphs. The tool employs a similarity matrix with a bag-of-words TF-IDF model with maximum TF-IDF cosine similarity as a similarity metric. The paragraph alignment uses three levels of vicinity: (1) 1:1, 1:n, and n:1 alignments; (2) single-unit skips (where units can be sentences or paragraphs); and (3) long-distance unit skips. Sentence alignment relies on (1) and (2) only.

Like MASSAlign, CATS is capable of aligning paragraphs and sentences in two steps. The tool offers three similarity strategies, a lexical (character-n-gram-based, CNG) and two semantic similarity strategies. The two semantic similarity strategies, WAVG (Word Average) and CWASA (Continuous Word Alignment-based Similarity Analysis), both require pretrained word embeddings. WAVG averages the word vectors of a paragraph or sentence to obtain the final vector for the respective text unit (sentence or paragraph). CWASA is based on the alignment of continuous words using directed edges. CATS offers two different alignment strategies: MST (Most Similar Text) and MST-LIS (MST with Longest Increasing Sequence) to allow for 1:n alignment.

LHA uses a hierarchical alignment approach with two steps: Firstly, document alignment is performed based on document embeddings and an approximate nearest neighbor search using the Annoy library². Annoy exhibits a low memory footprint via usage of static files as indexes. Secondly, sentence embeddings and an inter-sentence similarity matrix are used to extract K nearest neighbors for each source and target sentence. The tool further uses a variation of MST-LIS from CATS to model sentence splitting and compression.

Vecalign (Thompson and Koehn, 2019) and alignment based on SBERT (Reimers and Gurevych, 2020) were introduced in the context of bilingual sentence alignment. SBERT modifies the pretrained BERT network (Devlin et al., 2019) by using siamese and triplet network structures to arrive at sentence embeddings that may then be compared using cosine similarity. Vecalign is a method based on the similarity of the average sentence embedding with cosine similarity as the scoring function.

Table 1 characterizes the five alignment methods MASSAlign, CATS, LHA, SBERT, and Vecalign along the following aspects:

• All source sentences aligned: whether the alignment method in its default setup force-aligns every source sentence or bases the decision whether to align a source sentence on a similarity threshold (cutoff)³

• Concatenation: whether the alignment method concatenates multiple sentences into one and aligns them as one

• Crossing alignments: whether the alignment method allows for abandoning the monotonicity restriction, i.e., supports crossing alignments (cf. Figure 1)

• Alignment type: which relations between source and target sentences are ultimately supported by the method.

TABLE 1

Table 1. Overview of mono- and bilingual sentence alignment tools and methods.

2.2. Sentence-Aligned Parallel Corpora

Automatic text simplification via (sentence-based) machine translation as outlined in Section 1 requires pairs of standard-language/simplified-language texts aligned at the sentence level, i.e., parallel corpora. A number of parallel corpora have been created to this end. Gasperin et al. (2010) compiled the PorSimples Corpus consisting of Brazilian Portuguese texts (2,116 sentences), each with two different levels of simplifications (“natural” and “strong,”) resulting in around 4,500 aligned sentences. Bott and Saggion (2012) produced the Simplext Corpus consisting of 200 Spanish/simplified Spanish document pairs, amounting to a total of 1,149 (Spanish) and 1,808 (simplified Spanish) sentences (approximately 1,000 aligned sentences).

A large parallel corpus for text simplification is the Parallel Wikipedia Simplification Corpus (PWKP) compiled from parallel articles of the English Wikipedia and the Simple English Wikipedia (Zhu et al., 2010), consisting of about 108,000 sentence pairs. The difference in vocabulary size between the English and the simplified English side of the PWKP Corpus amounts to 18%⁴. Application of the corpus has been criticized for various reasons (Štajner et al., 2018); the most important among these is the fact that Simple English Wikipedia articles are often not translations of articles from the English Wikipedia. Hwang et al. (2015) provided an updated version of the corpus that includes a total of 280,000 full and partial matches between the two Wikipedia versions.

Another frequently used data collection, available for English and Spanish, is the Newsela Corpus (Xu et al., 2015) consisting of 1,130 news articles, each simplified into four school grade levels by professional editors. The difference in vocabulary size between the English side and the simplest level (Simple-4) is 50.8%.

The above-mentioned PorSimples and Newsela corpora present standard-language texts simplified into multiple levels, thus accounting for a recent consensus in the area of simplified-language research, according to which a single level of simplified language is not sufficient; instead, multiple levels are required to account for the heterogeneous target usership.

2.3. Automatic Text Simplification

Specia (2010) introduced statistical machine translation to the automatic text simplification task, using data from a small parallel corpus (roughly 4,500 parallel sentences) for Portuguese. Coster and Kauchak (2011) used the PWKP Corpus in its original form (cf. Section 2.2) to train an MT system. Xu et al. (2016) performed syntax-based MT on the English/simplified English part of the Newsela Corpus (cf. Section 2.2).

Nisioi et al. (2017) pioneered NMT models for text simplification, performing experiments on both the Wikipedia dataset of Hwang et al. (2015) and the Newsela Corpus for English, with automatic alignments derived from CATS (cf. Section 2.1). The authors used LSTMs as instances of Recurrent Neural Networks (RNNs).

More recent contributions to ATS include explicit edit operation modeling (Dong et al., 2019), graded simplification (Nishihara et al., 2019), multi-task learning (Guo et al., 2018; Dmitrieva and Tiedemann, 2021), weakly supervised (Palmero Aprosio et al., 2019), and unsupervised approaches (Surya et al., 2019; Kumar et al., 2020; Laban et al., 2021). These approaches are largely limited to English (Al-Thanyyan and Azmi, 2021) due to a lack of training data in other languages.

Säuberli et al. (2020) presented the first approach to text simplification for German using (sentence-based) NMT models. As data, they used an early version of the APA Corpus (cf. Section 3.2) amounting to approximately 3,500 sentence pairs.

The most commonly applied automatic evaluation metrics for text simplification are BLEU (Papineni et al., 2002) and SARI (Xu et al., 2016). BLEU, the de-facto standard metric for machine translation evaluation, computes token n-gram overlap between a hypothesis and one or multiple references. A shortcoming of BLEU with respect to automatic text simplification is that it rewards hypotheses that do not differ from the input. By contrast, SARI was designed to punish such output. It does so by explicitly considering the input and rewarding tokens in the hypothesis that do not occur in the input but in one of the references (addition) and tokens in the input that are retained (copying) or removed (deletion) in both the hypothesis and one of the references. More precisely, SARI computes the arithmetic average of n-gram precision and recall of the three rewrite operations addition, copying, and deletion, specifically rewarding simplifications that are dissimilar from the input. The metric was shown to exhibit “reasonable correlation with human evaluation on the text simplification task” (Xu et al., 2016).

Table 2 displays BLEU and SARI scores for previous sentence-level simplification approaches for different languages.

TABLE 2

Table 2. Automatic evaluation scores for sentence-level ATS approaches (PBMT, phrase-based SMT; SBMT, syntax-based MT).

3. Compiling data for automatic processing of simplified German

This section reports on our contributions in building and curating four parallel corpora for use in automatic text simplification for German.

3.1. Web Corpus

Klaper et al. (2013) created the first parallel corpus for German/simplified German, consisting of 256 texts each (approximately 70,000 tokens) downloaded from the Web. Battisti et al. (2020) extended the corpus such that it contained more parallel data, newly contained monolingual-only data (simplified German), and newly contained information on text structure (e.g., paragraphs, lines), typography (e.g., font type, font style), and images (content, position, and dimensions)⁵. The parallel part of the corpus is useful for automatic text simplification via machine translation (cf. Section 2.3), the monolingual-only part for automatic readability assessment, which is not the focus of this article. In addition, monolingual-only data can also be leveraged as part of machine translation through applying back-translation, a data augmentation technique.

The corpus is compiled from PDFs and webpages collected from Web sources in Germany, Austria, and Switzerland. Information on the underlying guidelines for creating simplified German is not available, as the data was collected automatically. The sources mostly represent websites of governments, specialized institutions, and non-profit organizations. The documents cover a range of topics, such as politics (e.g., instructions for voting), health (e.g., what to do in case of pregnancy), and culture (e.g., introduction to art museums). The corpus contains 6,217 documents, of which 5,461 are monolingual-only, and 378 are available in both standard German and simplified German. The 378 parallel documents amount to 17,121 sentences on the standard German and 21,072 sentences on the simplified German side. Compared to their German counterparts, the simplified German texts in the parallel data have clearly undergone a process of lexical simplification: The vocabulary is smaller by 51% (33,384 vs. 16,352 types), which is comparable to the rate of reduction reported in Section 2.2 for the Newsela Corpus (50.8%).

3.2. APA Corpus

A second corpus built by our group, which is a parallel corpus throughout, consists of news items of the Austria Press Agency (Austria Presse Agentur, APA) with their simplified versions.⁶ At APA, four to six news items per day covering the topics of politics, economy, culture, and sports are manually simplified into two language levels, B1 and A2, following guidelines by capito, the largest provider of simplification services (translations and translators' training) in Austria, Germany, and Switzerland⁷. Table 3 shows standard German/simplified German (B1) examples from the corpus (Säuberli et al., 2020). The corpus contains a total of 2,426 distinct documents. This amounts to 60,732 standard-language sentences, 30,328 sentences at level B1, and 30,432 sentences at A2. We generated sentence alignments with LHA (cf. Section 2.1), arriving at 10,268 alignments for B1 and 9,456 for A2. The sentence alignments are made available for research purposes⁸.

TABLE 3

Table 3. Examples from the Austria Press Agency (APA) corpus (Säuberli et al., 2020).

3.3. Wikipedia Corpus

This parallel corpus was created by automatically translating 150,064 articles of the Simple English Wikipedia (cf. Section 2.2) to German using DeepL⁹ ¹⁰. The synthetically created “simplified German” articles were then aligned on a document level with their standard German counterparts from the German Wikipedia¹¹ using interlanguage links, resulting in 106,126 parallel documents with 6,933,192 standard German sentences and 1,077,992 “simplified German” sentences.

3.4. Capito Corpus

As a provider of simplification services, capito produces a high number of professional simplifications for a variety of documents and text genres. This includes but is not limited to booklets, information texts, websites and legal texts, which are manually simplified into one or more levels following the capito guidelines. The simplification levels in this corpus include B1, A2, and A1. We extracted simplified German documents along with their standard German counterparts, amounting to 1,055 document pairs for B1, 1,546 for A2, and 839 for A1. The documents contain a total of 183,216 standard-language sentences, 68,529 sentences at level B1, 168,950 sentences at level A2, and 24,243 sentences at level A1. Aligning the sentences with LHA (cf. Section 2.1) yielded 54,224 sentence pairs for B1, 136,582 for A2, and 10,952 for A1.

Table 4 presents an overview of the four data sources.

TABLE 4

Table 4. Overview of the four parallel corpora for standard German/simplified German.

4. Sentence alignment gold standard and evaluation of automatic sentence alignment methods

This section reports on the manual creation of a gold standard for sentence alignment based on a subset of the four corpora introduced in Section 3. We subsequently evaluate the five automatic sentence alignment methods presented in Section 2.1 against this gold standard to allow us to select the most accurately aligned sentences as data to train our translation models in Section 5. For more details on this evaluation, see Spring et al. (2021a).

4.1. Method

To create a gold standard against which to measure the performance of the different automatic sentence alignment methods introduced in Section 2.1, we selected approximately 1,500 simplified-language sentences from each of the four sources described in Section 3: the Web Corpus (where 36 documents amount to approximately 1,500 simplified sentences), APA Corpus (134 documents), Wikipedia Corpus (198 documents), and the capito Corpus (42 documents), as summarized in Table 5. Two annotators independently aligned the simplified sentences to their standard-language counterparts, considering all of the alignment types shown in Section 2.1. In case of n:1 or 1:n alignments, the annotators assigned a list of labels of length n to either the standard- or simplified-language sentence. In case of 1:0 or 0:1 alignments, the annotators assigned a placeholder label to the empty standard- or simplified-language sentence. Inter-annotator agreement (Cohen's Kappa) for all corpora was between 0.730 and 0.924 (cf. Table 6). To create a single version of the gold standard, an arbitrator took the final decision in cases where the two annotators disagreed.

TABLE 5

Table 5. Overview of the gold standard of sentence alignments for standard German/simplified German.

TABLE 6

Table 6. Cohen's Kappa per data source.

4.2. Results

The alignment methods presented in Section 2.1 were used with their default settings and embeddings (where applicable)¹² to align sentences in the pairs of standard-language and simplified-language documents that make up the gold standard. Alignment was performed in both directions, simple to complex and vice versa, and the set of the extracted alignments for both directions was used. This made it possible to evaluate the alignment methods extracting n:1 alignments, even though the gold standard is n:m. Evaluation was performed with the Vecalign scoring script¹³. The scoring script made it possible to evaluate the diverse alignments that naturally occur in text simplification in a standardized way by converting all alignments to a collection of 1:1 alignments.

The results of evaluating the performance of the five alignment methods (MASSAlign, CATS, LHA, SBERT, Vecalign; with CATS featuring three sub-methods) against the gold standard are shown in Table 7 (Spring et al., 2021a). Lower CEFR levels (available in the capito and APA data) proved harder to align and in general corresponded to lower F1 scores. The alignment task becomes harder with increasing distance from standard German, as simplification requires more modifications to the text. Also, on lower CEFR levels, elaborations and explanations are increasingly common. Generally, the alignment methods performed best on the Web and capito data, with average F1 scores being considerably higher. The low overall scores on the Wikipedia data could be explained by the fact that it is the dataset with the largest disparity between the number of standard German and simplified German sentences (cf. Section 3). Regarding the alignment methods, LHA performed best on five out of the seven datasets. It is also the method with the highest F1 scores on average. On capito A1 and capito A2, Vecalign reached the highest scores.

TABLE 7

Table 7. F1 scores of sentence alignment evaluation from Spring et al. (2021a).

5. Sentence-based automatic text simplification

This section reports on our work in training NMT models on two of the data sources introduced in Section 3. For more details, the reader is referred to Spring et al. (2021b).

5.1. Method

For these experiments, we used the APA and the capito corpora introduced in Sections 3.2 and 3.4, respectively, amounting to 19,724 sentence alignments for the APA Corpus (10,268 for B1 and 9,456 for A2) and 201,758 for the capito Corpus (54,224 for B1, 136,582 for A2, and 10,952 for A1), produced with LHA (cf. Section 2.1).

Our baseline models were trained on all available training data across all levels, i.e., these models were language-level-agnostic. They performed generic simplification because they had no explicit method to determine the desired level of simplification on the target side. We trained transformer models (Vaswani et al., 2017) with five layers, four attention heads, 512 hidden units in transformer layers, and 2048 hidden units in transformers feed forward layers. Embedding dropout and label smoothing were set to 0.3. We used BLEU for early stopping on a held-out development set with a patience of 10 checkpoints. We trained with a shared vocabulary (20,000 BPE operations). All our experiments were carried out in sockeye (Hieber et al., 2018).

Our experimental models made use of source-side labels corresponding to the desired CEFR level of the target sentence. These labels allow the model to make a distinction between the different CEFR levels and thus to simplify into different complexity levels. Among others, labels have been used in a variety of tasks such as domain adaption (Kobus et al., 2017), multilingual translation (Johnson et al., 2017), and making better use of back-translation (Caswell et al., 2019). Apart from these modifications to the training data, the model architecture and all hyperparameters were identical to the baseline models and they used the same vocabulary of 20k.

To evaluate our models, we used a test set that consists of 500 parallel sentences each for A1, A2, and B1, which were randomly sampled from the combined corpus.

5.2. Results

The BLEU and SARI scores of our two models on the test sets are presented in Table 8. The SARI values of our baseline model are comparable to the results of Säuberli et al. (2020) (cf. Table 2), who used a preliminary version of the APA corpus of approximately 3,500 sentence pairs (cf. Section 2.3), but our baseline achieved higher BLEU scores in the range of 13.4 to 16.3. The experimental model reached improved scores for both metrics. The use of source-side labels boosted performance in terms of BLEU on A1 and B1, with the new values in the range of 14.1 to 17.2. The BLEU score did not improve for A2, which was the level with the highest amount of parallel data available (cf. Section 3). This indicates that the addition of source-side labels may be especially helpful in low resource settings, as, on the other hand, A1 and B1, for both of which there was substantially less data, reached higher scores with the experimental model. In terms of SARI, the addition of source-side labels led to considerable improvements for all levels, with the new scores lying in the range of 41.53 to 43.12.

TABLE 8

Table 8. BLEU and SARI scores of the different models.

6. Conclusion and Outlook

This article has presented the work of our group in automatic processing of simplified German. We have given an overview of four parallel corpora compiled and curated by our group: the Web, APA, Wikipedia, and capito corpora. Moreover, we have reported on the creation of a gold standard of sentence alignments from the four sources for evaluating five alignment methods on this gold standard (MASSAlign, CATS, LHA, SBERT, Vecalign; with CATS featuring three sub-methods). We found that LHA performed best on five out of the seven datasets (Web, Wikipedia, capito A1, capito A2, capito B1, APA A2, APA B1). It was also the method with the highest average F1 scores (on capito A1 and capito A2, Vecalign reached the highest absolute scores). In general, for the multi-level sources (capito and APA), lower CEFR levels proved harder to align and corresponded to lower F1 scores. Intuitively, the alignment task becomes harder with increasing distance from standard German, as simplification requires more modifications to the text. Also, on lower CEFR levels, elaborations and explanations are increasingly common. Generally, the alignment methods performed best on the Web and capito data, with average F1 scores being considerably higher. The low overall scores on the Wikipedia data can be explained by the fact that it is the dataset with the largest disparity between the number of standard German and simplified German sentences.

We used the LHA alignments as a basis for the first sentence-based neural NMT approach toward automatic simplification of German (baseline model), and we proposed a model that is capable of explicitly operating on multiple levels of simplified German. We showed that compared to our baseline model, this multi-level experimental model reached improved scores for both automatic evaluation metrics, BLEU and SARI. Specifically, performance improved on all levels with respect to SARI and on A1 and B1 with respect to BLEU (A2 is the level with the highest amount of parallel data available).

We plan to further investigate the potential of the various alignment methods by varying the embedding strategies and the cutoff values used. In doing so, we expect to further increase the performance of our text simplification approaches according to automatic metrics. In addition, we plan to evaluate the output of future models with the help of human experts and to investigate the comprehensibility of the output among the target groups, e.g., persons with cognitive impairments.

Data Availability Statement

The datasets presented in this article are not readily available because data from the commercial provider of simplification services (capito) is not publishable. Sentence alignments based on APA have, however, been made publicly available here: https://zenodo.org/record/5148163.

Author Contributions

SE as the group leader was involved in the creation of the parallel corpora, the sentence alignment gold standard, and the text simplification experiments. AS, NS, and AR carried out the text simplification experiments. AB was the primary person responsible for the Web corpus and one of the annotators of the sentence alignment gold standard. DP was the second annotator. NS was the arbitrator and performed the evaluation of the sentence alignment methods relative to the gold standard. MK provided the sentence alignments. All authors contributed to the article and approved the submitted version.

Funding

This submission received funding through the capito automatisiert project funded by the Austrian Research Promotion Agency (Österreichische Forschungsförderungsgesellschaft, FFG) General Programme under grant agreement number 881202.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Acknowledgments

The authors are greatly indebted to the Austria Presse Agentur (APA) and CFS GmbH for providing data consisting of standard German documents with their simplified counterparts.

Footnotes

1. ^The term “simplified language” i os used to denote the sumf all “comprehensibility-enhanced varieties of natural languages” (Maaß, 2020, p. 52), i.e., what is commonly termed “Easy Language” (German leichte Sprache) and “Plain Language” (German einfache Sprache). Maaß (2020, p. 52) mentions “easy-to-understand language” as an umbrella term subsuming these varieties. However, in this contribution, we prefer the term “simplified language” to emphasize the notion of the result of a simplification process.

2. ^https://github.com/spotify/annoy (last accessed: May 5, 2021).

3. ^Note that for CATS, the alignment direction is from simplified language to standard language; hence, CATS searches for one or more standard-language sentences for each simplified-language sentence.

4. ^Vocabulary size as an indicator of lexical richness is generally taken to correlate positively with complexity (Vajjala and Meurers, 2012).

5. ^The importance of the latter type of information has repeatedly been stressed, e.g., for automatic readability assessment (Bredel and Maaß, 2016a; Arfé et al., 2018; Bock, 2018).

6. ^Note that news items are among the most frequent sources of simplification (Caseli et al., 2009; Klerke and Søgaard, 2012; Bott and Saggion, 2014; Goto et al., 2015; Xu et al., 2015).

7. ^https://www.capito.eu/ (last accessed: August 4, 2020). capito distinguishes between three levels along the Common European Framework of Reference for Languages (CEFR) Council of Europe (2009): A1, A2, and B1. Each level is linguistically operationalized, i.e., specified with respect to linguistic constructions permitted or not permitted at the respective level.

8. ^https://zenodo.org/record/5148163 (last accessed: October 14, 2021).

9. ^https://www.deepl.com/translator (last accessed: May 5, 2021). Simple English Wikipedia authors are instructed to “use Basic English words and shorter sentences”, where Basic English refers to the variety introduced by Ogden (1944) that consists of 850 words on the lexical side.

10. ^The Simple Wikipedia dump of 12/12/2019 was used, https://dumps.wikimedia.org/simplewiki/ (last accessed: April 26, 2021).

11. ^Obtained by using the CirrusSearch dump as of 14/09/20, https://dumps.wikimedia.org/other/cirrussearch/ (last accessed: May 5, 2021).

12. ^One of the tools, CATS, for example, offers an n-gram-based alignment approach that does not employ embeddings of any kind.

13. ^https://github.com/thompsonb/vecalign (last accessed: April 26, 2021).

References

Al-Thanyyan, S. S., and Azmi, A. M. (2021). Automated text simplification: a survey. ACM Comput. Surveys (CSUR) 54, 1–36. doi: 10.1145/3442695

CrossRef Full Text | Google Scholar

Aluisio, S. M., and Gasperin, C. (2010). “Fostering digital inclusion and accessibility: the porsimples project for simplification of portuguese texts,” in Proceedings of the NAACL HLT 2010 Young Investigators Workshop on Computational Approaches to Languages of the Americas (Los Angeles, CA), 46–53.