Trigraphs and digraphs and milliseconds oh my!

I’ve been reading my papers on recognizing biometrics from keystroke info and it generally seems to either work from training neural nets or from examining the timing of certain letter patterns, particularly digraphs and trigraphs. I’m currently working on parsing the raw data into more manageable data that can be stored in a form specific table:

CREATE TABLE IF NOT EXISTS `trigraph_table` (
  `uid` int(11) NOT NULL AUTO_INCREMENT,
  `session_id` varchar(255) NOT NULL,
  `word` varchar(255) NOT NULL,
  `milliseconds` int(11) NOT NULL,
  PRIMARY KEY (`uid`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

I can point back to the session_table for all the associated data if there is not sufficient clustering with just the session_id. Speaking of the session_table, it’s getting downright scary:

CREATE TABLE IF NOT EXISTS `session_table` (
`uid` int(11) NOT NULL AUTO_INCREMENT,
`session_id` varchar(255) NOT NULL,
`type` int(11) NOT NULL,
`entry_time` datetime NOT NULL,
`ip_address` varchar(255) NOT NULL,
`browser` varchar(255) NOT NULL,
`referrer` varchar(255) NOT NULL,
`submitted_text` text NOT NULL,
`raw` MEDIUMTEXT NOT NULL,
`parent_session_id` varchar(255) DEFAULT NULL,
`veracity` int(11) NOT NULL,
`hostname` varchar(255) NOT NULL,
`city` varchar(255) NOT NULL,
`region` varchar(255) NOT NULL,
`country` varchar(255) NOT NULL,
`latlong` varchar(255) NOT NULL,
`service_provider` varchar(255) NOT NULL,
`postal` varchar(255) NOT NULL,
  PRIMARY KEY (`uid`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

The idea that we are anonymous is just silly, from the moment we connect to a server, we’re basically exposed. And I’m just getting started with gathering data. Imagine what experienced web developers can do, particularly once cookies are enabled. Minimising that damage to someone who posts to the site is probably on the list of things to look at. Maybe server hardening? Certainly no links from the page.

On a different note, I’ve been thinking a bit about how confident we need to be of a source before we can start determining information trustworthiness. It seems to me that if we can show that the information comes from a wide number of individuals, we get information as well, even if we can’t sufficiently distinguish one. That trips up the one individual reporting something unique scoop aspect though.

Another thing that’s kind of interesting is typos and spell check. Keeping track of typos is easy – just run everything through a dictionary. A different, also potentially useful thing is to look at the differences in words produced by keystrokes and the submitted text. Where those differ, some sort of spell correction was used, which doesn’t behave like a paste, or it would be trapped. Anyway, it’s another form of interesting data.

 

Safe(er) Data and Nonexistent Functions

If you want to reduce the likelihood of a SQL injection attack, use, precompiled queries. Nice in theory, tougher in practice. The nub of the problem appears to be the way that PHP binds data to execute the insert or the pull. With a nice, vulnerable query you can use string manipulation functions and as such make nice, general functions. However, if you’re mean, you can add something like “;DROP TABLE students; and poof, the table students is gone. Now, there should be a nice call that returns everything as an associative array, but that doesn’t seem to be reliable across PHP installations, so we need to work with the much more restrictive fetch();

Things to remember:

  • Everything has to happen when the statement is available, between prepare() and close().
  • Use bind_params(String datatypes…) to send data and bind_results for returning data. bind_params is less picky – you can access elements of an array directly. For bind_results you have to have individual variables declared.
  • When things go wrong in the PHP mysql code, it is likely that an HTML table will be returned. That will need to be handled.
  • Stringify and parse of objects into and out of JSON may or may not handle hierarchies. Watch what goes on in the debugger.

Anyway that just about doubled the line count in the middleware and bound the PHP code much more tightly to the form of the database. That being said, this is intended to have some production values in it anyway, so that may be a good thing. The new and improved results are in the same old place, namely io2.html. Next comes the integration of all that DB work, the recognizer part, and the panel part.