final results

Milestone date: 29-Apr-2009
This milestone focused on the creation of a hi-fi prototype to test some of the ideas described in previous milestones.

process

Realizing that, in this case, a hi-fi prototype would be most useful if it involved creating the entire infrastructure (from data to front-end) that the application requires, I got to work on the steps required to create such a visualization. The steps included:

  1. Extracting data from the ThML files.
  2. Normalizing, cleaning, and filling-in the data.
  3. Building and populating a relational database to support the application.
  4. Establishing a middle-tier structure between the database and the visualization.
  5. Creating the visualization itself.

One of way of characterizing each of these steps is to take an in-depth look at the technologies and techniques involved:

  1. Starting with 867 ThML files, I extracted the data using Perl and XPath expressions (using the XML::XPath module obtained from CPAN). Some of this data (e.g. author information) was exported to Comma Separated Values (CSV) files for manual cleaning.
  2. These CSV files were loaded into a spreadsheet editor (I won't say which one) which allowed me to quickly identify unique (and duplicate authors) for each ThML file as well as standardize spelling. I also used Google and Wikipedia to quickly identify the birth and death years of authors when this information was not available in the ThML files. I constructed a CSV file describing the Bible using information found online (from Deaf Missions) and the Open Scripture Information Standard (OSIS) abbreviations for scripture markup.
  3. Once the data was extracted and cleaned, I designed a normalized database schema for the data and wrote a Perl script to insert the data into a SQLite database.
  4. This 48MB database was upload to a server where I used PHP scripts to access the database and create value objects. These server-side objects were accessed using the AMFPHP framework, an open-source framework that works with Flash Remoting (built-in to Adobe Flex) to enable data access to the Flash Player.
  5. Writing ActionScript and MXML code using Flex Builder, I designed and implemented layout algorithms for my main visualization as well as the author timeline visualization. I then implemented the various interactions that the prototype supports.

challenges

Each of these steps carried its own set of obstacles and challenges to overcome; however, some stick-out in my mind as particularly troublesome.

thml documents

One of the first challenges relates to the nature of the ThML documents. As mentioned above, some manual cleaning was required in order to get the best data from these documents. Given that these documents are hand-coded by various volunteers, they are remarkably consistent. However, it is still the case that the appropriate data was coded in various ways in each of these documents. My Perl scripts for extraction have to account for this fact, and it required a large amount of trial-and-error as well as manual document inspection in order to ensure appropriate coverage of the documents.

performance

It did not take long before it became apparent that performance was going to be a constant issue in creating this visualization. Various design decisions (e.g. using semi-transparent edges) exacerbate this problem; however, just the raw number of references (1,309,906 unique scripture references) in the CCEL make data processing very challenging. I have attempted to optimize the database structure and queries to the best of my abilities, but it still takes over 10 seconds to query, encode, and transfer all of the scripture book references (16,192 book references) for all CCEL texts. Displaying all of these references slows the visualization to a crawl, hence the default edge filtering to references that represent at least 10% of a text's scripture references.

author timeline

From the start, I knew the author timeline was going to be problematic. In the prototype, it is incomplete. It is impossible to resize the slider window. There are no labels to indicate the various years. And there is no obvious way of ceasing to filter by authors. author timeline

In my original design, I had intended that the rounded upper corners of the timeline slider be used to resize the slider window. This is still my intent, it just turns out that's pretty difficult to implement. Another issue comes from the design of my layout algorithm and the reality of the data. My initial thought was to make author lines to gravitate toward the center, equally balanced on both sides while preventing overlap. However, the nature of the author life span data is such that there is a large explosion after 1800, causing this layout strategy to produce a decidedly ugly timeline. slider attempt 1 In order to fix this problem, I simply chose to allow overlap when the timeline exceeds some prescribed height. While this is less accurate, I think it still conveys something about the nature of the data while fitting in a compact space. slider attempt 2

Finally, the author timeline does not present a clear affordance for disabling filtering when the user wants to display texts by all of the authors. It is not apparent to me if there is an elegant, simple design solution.

future work

In light of these challenges, the way forward is clear. My first priority is to improve performance because I believe this would be a major turn-off to users. Furthermore, the author timeline component needs to be polished in the ways described above.

There is, of course, significant functionality not included in the prototype. Some must-have features that come to mind include the ability to search by title or author, the ability to select CCEL texts to follow while changing the filtering parameters, and the ability to view scripture references at the chapter and verse levels of aggregation (in addition to the current book level of aggregation).