Computational Representation of Chinese Characters: Comparison Between Singular Value Decomposition and Variational Autoencoder

Year
2020
Volume 21
Issue 1
Pages
141-160
Authors
Yu-Hsiang Tseng, Shu-Kai Hsieh
Abstract
Being a notoriously complex problem, writing is generally decomposed into a series of subtasks: idea generation, expression, revision, etc. Given some goal, the author generates a set of ideas (brainstorming), which he integrates into some skeleton (outline, text plan, outline). This leads to a first draft which is submitted then for revision possibly yielding changes at various levels (content, structure, form). Having made a draft, authors usually revise, edit, and proofread their documents.  We confine ourselves here only to academic writing, focusing on sentence production. While there has been quite some work on this topic, most writing assistance has mainly dealt with grammatical errors, editing and proofreading, the goal being the correction of surface-level problems such as typography, spelling, or grammatical errors.  We broaden the scope by also including cases where the entire sentence needs to be rewritten in order to express properly all of the information planned. Hence, Sentence-level Revision (SentRev) becomes part of our writing assistance task.  Obviously, systems performing well in this task can be of considerable help for inexperienced authors by producing fluent, well-formed sentences based on the user’s drafts.  In order to evaluate our SentRev model, we have built a new, freely available crowdsourced evaluation dataset which consists of a set of incomplete sentences produced by nonnative writers paired with final version sentences extracted from published academic papers. We also used this dataset to establish baseline performance on SentRev.

Keywords: Chinese characters, eigencharacter, orthography, singular value decomposition, variational autoencoder