New computer, new plans

I am now the proud(?) owner of a brand new ThinkStation. Nice, fast box, and loaded up with dev and analysis goodies.

Based on discussion with Dr. Hurst, I’m going to look into using Mechanical Turk as a way of gathering large amounts of biometric data. Basically, the goal is to have Turkers log in and type training text. The question is how. And that question has two parts – how to interface with the Turk system and how to get the best results. I’ll be researching these over the next week or so, but I thought I’d put down some initial thoughts here first.

I think a good way to get data would be to write a simple game that presents words or phrases to a user and has them type those words back into the system. Points are given for speed (up to 10 seconds for word?) and accuracy (edit distance from word). Points are converted into cash somehow using the Turk API?

The list of words should be the 100 most-used English words? Words/phrases are presented randomly. There is some kind of upper and lower limit on words that a user can enter so the system is not abused. In addition, ip address/browser info can be checked as a rough cull for repeat users.

Ok, I’m now the proud owner of an AWS account and a Turk Requestor account. I’ve also found the Amazon Mechanical Turk Getting Started Guide, though the Kindle download is extremely slow this Christmas Day.

Setting up sandbox accounts. Some things appear to timeout. Not sure if I’ll be able to use the api directly. I may have to do a check against the session uid cut-and-paste.

Set up Admin, dev and read-only IAM user accounts.

Accessed the production and sandbox accounts with the command line tools. Since I already had a JRE, I just downloaded the installed directory. You need to create an MTURK_CMD_HOME environment variable that points to the root of your turk install. In my case ‘C:\TurkTools\aws-mturk-clt-1.3.1’ Do not add this value to your path – it makes java throw a fit. The other undocumented thing that you must do is change the service-url values for the accounts from http to https.

To log in successfully, I was not able to use the IAM accounts and had to use a rootkey. And sure enough, when I looked over the AWS Request Authentication page, there it was: Amazon Mechanical Turk does not use AWS Identity and Access Management (IAM) credentials. Sigh. But that’s 4 tabs I can close and ignore.

Setting up the Java project for the HIT tutorial.

  • Since I’ve been using IntelleJ for all my JavaScript and PHP, I thought I’d see how it is with Java. The first confusion was how to grab all the libraries/jar files that Turk needs. To add items, use File->Project Structure. This brings up a ‘Project Structure” dialog. Pick Modules under Project Settings then click the Sources/Paths/Dependencies tab. Click the green ‘+‘ on the far right-hand side, then select Jar or Directories. It doesn’t seem to recurse down the tree, but you can shift-click to select multiple one-level directories. This should then populate the selected folders in the list that has the Java and <Module Source> listed. Once you hit OK at the bottom of the dialog, you can verify that the jar files are listed under the External Libraries heading in the Project panel.
  • Needed to put the mturk.properties file at the root of the project, since that’s where System.getProperty(“user.dir”) said I should.

Success! Sent a task, logged in and performed the task on in the sandbox. Now I have to see how to get the results and such.

And the API information on Amazon makes no sense. It looks like the Java API is not actually built by Amazon, the only true access is through SOAP/REST. The Java API is in the following locations:

If you download the zip file, the javadoc API is available in the docs directory, and there appear to be well-commented samples in the samples directory.

Took parts from the simple_survey.java and reviewer.java examples and have managed to post and approve. Need to see if already approved values can be retrieved. If they can be, then I can check against previous uids. Then I could either pull down the session_ids as a list or create a new PHP page that returns the session_ids restfully. Kinda like that.

The php page that returns session_ids is done and up: http://philfeldman.com/iRevApps/dbREST.php. I’ve now got java code running that will pull down the entire xml document and search through it looking for a result that I can then match against the submitted values. I need to check to see that the value isn’t already approved, which means going through the approved items first to look for the sessionIDs

And, of course, there’s a new wrinkle. A worker can only access an HIT once, which has led me to loading up quite a few HITs. And I’ve noticed that you can’t search across multiple HITs, so I need to keep track of them and fill the arrays from multiple sources. Or change the db to have an ‘approved’ flag, which I don’t want to do if I don’t have to.  I’d rather start with keeping all the HIT ids and iterating over them. If that fails, I’ll make another RESTful interface with the db that will have a list of approved session_ids.

Iterating over the HITs in order of creation seems to work fine. At least good enough for now

Bought and set up tajour.com. Got a little nervous about pointing a bunch of turkers at philfeldman.com. I will need to make it https after a while too. Will need to move over the javascript files and create a new turk page.

Starting on the training wizard directive. While looking around for a source of training text I initially tried vocabulary building sites but wound up at famous quotations, at least for the moment. Here’s one that generates random quotes.

The wizard is done and it’s initial implementation as my turk data input page. Currently, the pages are:

  • irev3 – the main app
  • irevTurk – the mechanical turk data collection
  • irevdb – data pulls and formatting from the db.

And I appear to have turk data!

Advertisements

First Draft?

I think I’ve made enough progress with the coding to have something useful. And everything is using the AngularJS framework, so I’m pretty buzzword compliant. Well, that may be to bold a statement, but at least it can make data that can be analyzed by something a bit more rigorous than just looking at it in a spreadsheet.

Here the current state of things:

The data analysis app: http://tajour.com/iRevApps/irevdb.html

Changes:
  • Improved the UX on both apps. The main app first.
    • You can now look at other poster’s posts without ‘logging in’. You have to type the passphrase to add anything though.
    • It’s now possible to search through the posts by search term and date
    • The code is now more modular and maintainable (I know, not that academic, but it makes me happy)
    • Twitter and Facebook crossposting are coming.
  • For the db app
    • Dropdown selection of common queries
    • Tab selection of query output
    • Rule-based parsing (currently keyDownUp, keyDownDown and word
    • Excel-ready cvs output for all rules
    • WEKA ready output for keyDownUpkeyDownDown and word. A caveat on this. The WEKA ARFF format wants to have all session information in a single row. This has two ramifications:
      • There has to be a column for every key/word, including the misspellings. For the training task it’s not so bad, but for the free form text it means there are going to be a lot of columns. WEKA has a marker ‘?’ for missing data, so I’m going to start with that, but it may be that the data will have to be ‘cleaned’ by deleting uncommon words.
      • Since there is only one column per key/word, keys and words that are typed multiple times have to be grouped somehow. Right now I’m averaging, but that looses a lot of information. I may add a standard deviation measure, but that will mean double the columns. Something to ponder.

Lastly, Larry Sanger (co-founder of Wikipedia) has started a wiki-ish news site. It’s possible that I could piggyback on this effort, or at least use some of their ideas/code. It’s called infobitt.com. There’s a good manifesto here.

Normally, I would be able to start analyzing data now, with WEKA and SPSS (which I bought/leased about a week ago), but my home dev computer died and I’m waiting for a replacement right now. Frustrating.