What is this website all about?

This website is part of my PhD research. Its purpose and nature are explained in this paper. Eventually I'll put all the contextual info here.

I get "Sorry, invalid email."

Please just email me (marzagao dot 1 at osu dot edu) and I'll add your email address right away (for now only pre-approved email addresses are valid, for cost reasons).

What happens after I hit the "submit" button?

Once you click "submit" a Python script checks whether your input is valid (whether your email is valid and you haven't submitted too many times). That happens on Google App Engine, where this website is hosted. If your input is valid that same Python script requests a remote computer from Amazon Elastic Cloud Compute service (Amazon EC2) and sends your input to it. On that remote machine another Python script checks whether your reference scores are valid (i.e., whether there is at least one reference case and one virgin case), downloads the necessary data (which is stored on Amazon Simple Storage service - Amazon S3), does all the computations, and email the results to you, after which the machine self-terminates.

Here's a flowchart.

In case you are curious, your results are produced by an 8-core Intel Xeon E5-2670v2 CPU with 61GB of RAM, running Linux. Every user submission fires up one of these machines, so the submissions are processed independently; they don't communicate or interact in any way (that prevents mess-ups and makes coding simpler).

I want to do everything myself. Where is the data?

The data (200GB) is stored on Amazon S3. Amazon might charge me "out-of-region" download fees ($0.12/GB) if I gave you the links, so email me instead and we'll figure out a cheaper way to get the data to you.

Entering one number at a time is a pain.

I know. I want to replace that table with an Excel-like spreadsheet, so that you can simply copy and paste your reference scores straight from Excel. I will also have user accounts and cookies, so that you can pre-populate the table with your previous submission. But I need to finish my PhD first.

Why does it take so long to run?

Three reasons: dataset size, throughput, budgetary constraints.

Reason #1: dataset size

The Automated Democracy Scores are based on 42 million newspaper and magazine articles (see paper for details). Parsed and transformed into term-frequency matrices, that's 24 billion data points, 200GB of disk space. I can't feed all that into the computer memory at once. Instead I need to slice the data and process one chunk at a time. And that's a lot of chunks.

Reason #2: throughput

The data is stored on one platform (Amazon S3), but the computer that runs the thing is on another platform (Amazon EC2). Since the data and the computer are physically separated we need to move the data around before using it. The throughput is actually pretty high - it all happens much faster than whatever your download speed is at home -, but still, it's 200GB of data.

Reason #3: budgetary constraints

Runtime could be brought down from hours to minutes or seconds. Right now your results are produced by a single computer - a remote Amazon machine that launches when you hit "submit" and self-terminates when your results are emailed. I could instead divide the work across N machines and thus be done N times faster. In theory the cost should be the same: 1 machine working N hours costs the same as N machines working 1 hour.

But in practice that's not so simple. Amazon charges me a full hour every time I start a new machine - even if I only use it for a millisecond. And to bring runtime down to minutes or seconds I would need to fire up A LOT of machines. To bring runtime down to ten seconds, for instance, I would need to fire up some 2520 machines. The type of machine I'm using costs $1.705/hour, so firing up 2520 machines - even for a millisecond - would cost $4297.

And that's ignoring the time it takes to fire up the machines in the first place. That often takes a couple of minutes, so even with $4297 per use I wouldn't be able to bring runtime down to seconds. To achieve seconds I would need the machines to be up and running 24/7 - at a cost of $4297 per hour times the total number of potential users. Needless to say, that is not going to happen.

Granted, I could bring runtime down to about 1 hour while keeping total cost the same. But the more machines you rely on, the higher the probability of an error happening at some point. Sometimes you request a machine but the launch fails. Sometimes a connection error happens and messes up communication between the machines. And so on. Also, the code gets more complex, which itself increases the probability of errors. Bringing runtime down to 1 hour probably doesn't justify the risk.

In the future I intend to migrate from Amazon EC2 to Google Compute Engine (not to be confused with Google App Engine, which is what I use to host the fron-end). Google Compute Engine charges by the minute, not by the hour, so things should improve greatly. But I would need to re-write a lot of code and I need to finish my PhD first.