via natural language processing blog von hal am 11.02.12
I received the following (slightly edited) question from my colleague Jon Katz a few days ago:
I was thinking about the problem of authorship attribution... Have people thought about the flip side of this problem? Namely, "anonymizing" text so that it would be hard to attribute it to any author?
This is something I've actually wondered about in the context of blogging for a while.  I noticed at some point that my "blogger voice" is very similar to my "reviewer voice" and started worrying that I might be too identifiable as a reviewer.  This might either be due to lexical choice ("bajillion" or "awesome") or due to some more subtle stylistic choices.

There is quite a bit of work on authorship attribution.  I think the first time I heard a talk on this topic was on March 24, 2004, when Shlomo Argamon gave a talk at ISI (no, I don't have an amazing memory, I cheated) on "On Writing, Our Selves: Explorations in Stylistic Text Categorization."  The basic hypothesis of the talk, at least as I remember it, was that if you're trying to do authorship attribution, you should throw out content words and focus on things like POS tag sequences, parse tree structures, and things like that.

There's been a lot of subsequent work in this, and related areas.  One very related area is on things like trying to predict demographic information (age, gender, socio-economic status, education level, and, yes, astrological sign) from tweets, blog posts or emails (or other forms).  One of the key distinctions that I think is important in all of this work is whether the original author is intentionally trying to hide information about him or herself.  For instance, someone trying to impersonate Shakespeare, or a child predator pretending to be a different age or gender, or a job applicant trying to sound more educate than is true.  This latter is a much harder problem because the stupid topically stereotypical features that pop out as being indicative (like men talking about "wifes" and "football" and women talking about "husbands" and "yoga") and the silly features that don't really tell us anything interesting (on twitter, apparently men tend to put  "http://" before URLs more than women -- who knew?) because these "pretenders" are going to intentionally try to hide that information (now that everyone knows to hide "http://" to trick gender recognizers!).  It also means that falling back on topic as a surrogate for demography should not work as well.  This seems to be a very different problem from trying to identify whether a blog post is written by me or by Jon, which should be 99.9% do-able by just looking at content words.

The reason I bring this all up is because we don't want to anonymize by changing the topic.  The topic needs to stay the same: we just need to cut out additional identifying information.  So, getting back to Jon's question, the most relevant work that I know of is on text steganography (by Ching-Yun Chang and Stephen Clark), where they use the ability to do paraphrasing to encode messages in text.  Aside from the challenge of making the output actually somewhat grammatical, the basic idea is that when you have two ways of saying the same thing (via paraphases), you can choose the first one to encode a "0" and the second to encode a "1" and then use this to encode a message in seemingly-natural text.

I also remember having a conversation a while ago while a (different) colleague about trying to build a chat system where you could pretend that you're chatting with someone famous (like Obama or Harry Potter or Scooby Doo).  A similar problem is trying to paraphrase my own writing to sound like someone else, but zoinks, that seems hard!  A basic approach would be to build a Scooby Doo language model (SDLM) and then run my blog posts through a paraphrase engine that uses the SDLM for producing the output.  My vague sense is that this would work pretty poorly, primarily because the subtleness in phrase structure selection would be lost on a highly-lexicalized language model.  I imagine you'd get some funny stuff out and it might be amusing to do, but I don't have time to try.

As far as pure anonymization goes, it seems like doing something similar to the steganography approach would work.  Here, what you could do is generate a random sequence of bits, and then "encode" that random sequence using the steganography system.  This would at least remove some identifying information.  But the goal of the steganography isn't to change every phrase, but just to change enough phrases that you can encode your message.  It also wouldn't solve the problem that perhaps you can identifying a bit about an author by the lengths of their sentences.  Or their oscillation between long and short sentences.  This also wouldn't be hidden.

An alternative, human-in-the-loop approach might be simply to have an authorship recognition system running in your word processor, and then any time you type something that enables it to identify you, it could highlight it and you could be tasked with changing it.  I suspect this would be a frustrating, but fairly interesting experience (at least the first time).

p.s., I'm now officially tweeting on @haldaume3.

via Radford Neal's blog von Radford Neal am 11.02.12

I’ve released a new version of my software for Low Density Parity Check (LDPC) codes. These error-correcting codes were invented by Robert Gallager in the early 1960′s, and re-invented and shown to have very good performance by David MacKay and myself in the mid-1990′s. The decoding algorithm for LDPC codes is related to that used for Turbo codes, and to probabilistic inference methods used in other fields. Variations on LDPC and Turbo codes are currently the best practical codes known, in terms of their ability to transmit data at rates approaching channel capacity with very low error probability.

This new version has only a few bug fixes and minor enhancements. The big change is that I’ve put up the source code as a Github repository, that uses the git source code control system. This should make it easier for other people to create their own extensions of the software. The software is available here, and the Github repository is here.

For me, this is also an exercise in learning about git and Github. From my initial experience, git does seem better than the source code control systems I’ve used previously (which are Subversion, CVS, and the “modify” utility of the Kronos operating system for the CDC 6000 series of computers). The help pages are a bit cryptic in places, though, at least for the novice user…


via Radford Neal's blog von Radford Neal am 11.02.12

Click on image for larger version.

Toronto, January 2012. Nikon F3, Micro-Nikkor-P Auto 55mm 1:3.5 lens (AI converted), Black’s (Fuji?) ISO 200 film, Nikon Coolscan V.


The current version of the d8taplex news site provides a mapping between news article authors and their Twitter accounts. In addition, I've been experimenting with mining all the tweets that link to a specific news article. This has lead to a couple of observations.

  1. Reuters contributors with Twitter accounts don't tweet very much. I can quantify this in a later post, but if you look on the d8taplex site at any time, you will rarely see the icon that indicates a journalist has recently tweeted ().
  2. Twitter users who post links to news articles rarely contribute any useful hashtags. The version of the site that I have running locally accumulates all the hashtags associated with an article and they invariably tend to simply be the main nouns in the article title. I was hoping for some sort of magic here but it doesn't look like it's there. For example, an article with the title 'Exclusive: Alibaba plans to take Hong Kong unit private' has the most frequent hashtag 'alibaba'.

The next stones I plan to look under include: aggregate analysis of tweets to better understand the people who tweet news links and crawling conspicuous blogs to see if anything interesting happens when 'influential' bloggers link to source news material. Stay tuned.

via David R. MacIver von david am 06.02.12

A momentous occasion has just, err, occurred. I got a green build on my weighted feedback arc set solver. This has never happened before.

This might seem like a confusing concept. How can it never have passed its own test suite?

Well, the test suite has always been more aspirational than reality based. The way it effectively works is this: It has a lot of test data and it tracks the best solution found for each file, along with its score. The quality test is that the score of the solution found is within 5% of the best known solution (side-note: part of why the tests are now green is that I did raise this threshold a bit, it was originally at 1%. Most tests still hit that better threshold). A side-effect of this is that improvements to the solver will often make the tests harder to pass because they will find better solutions (and, surprisingly, this is still happening – I’ve spent enough CPU cycles and algorithmic tweaks that I keep thinking I’ve found all the optimal solutions only to be surprised by an improvement. The improvements are rarely large, but they still keep coming)

The result was that I would often make improvements to the FAS solver and then back them out because they were too slow, but leaving the test data behind, and the test would be failing as the FAS solver dreamed of the long lost days when it was cleverer.

Additionally, all the tests are capped at the fairly arbitrary maximum acceptable runtime of one minute.

At this point I should mention that finding even an approximation (I forget to within what bound – I think it’s something like half) of the optimal score for weighted FAS is an NP-hard problem, and that even calculating the score of a given instance is \(O(n^2)\). So this isn’t an easy problem to solve.

I’d largely stalled on development on it, but today I decided to go back to it and after a day of tinkering I’ve managed to make a surprising amount of progress. At the start of the day it was painfully slow on several test cases (several minutes), non-deterministic and badly failing several of its quality goals. Now it is deterministic, runs each test case in under a minute and gets within 2.5% of the best known result on all of them. For added pleasantry, the latest version has even provided the best known result for several of them, and the code is shorter and clearer than it was at the start of the day.

How does it work?

It’s pretty simple. In the end it’s ended up looking up hilariously close to the algorithm I threw together in an hour from a random paper for The Right Tool (now Hammer Principle) a few years ago.

The process is basically a mix of the following techniques:

  1. Initial scoring. This is basically a Markov chain scoring system as before. The details are slightly different, but not really in an interesting way – the idea is still to transition to better nodes and use an approximation of the stable distribution as a score
  2. What I refer to in the code as “forcing connectivity”. This is today’s clever idea, and it works really well, but it probably shouldn’t and the fact that it appears to might be a test data artifact. If you’re feeling generous it could be described as a heuristic approach, but a fairer description would perhaps be an unprincipled hack. It’s based on the observation that in sparse data sets you often end up with long chunks of ties, and these almost completely defeat local searches and make it hard to improve things. The force runs through the array one element at a time and if it’s unconnected to the next element pulls forward the next element to which it is connected (it doesn’t concern itself with ordering beyond that – the element might be strictly greater or less than it, it doesn’t care)
  3. Local sort, as before. Basically an insertion sort on the hopelessly inconsistent ordering the tournament induces. This had largely disappeared from the code – it was implicitly happening anyway, but there was no explicit local sort stage – and bringing it back in was a surprisingly big win
  4. Window optimisation. Basically, you can brute force the problem for small arrays. Window optimisation is brute forcing every subrange of some length. This is applied at various lengths in order to get mixing
  5. Single move optimisation. This is a really expensive process, but wins big enough that it’s worth it. Basically we try every possible move of a single element from one point to another and apply any that improve the score

It’s surprising how well this combination works – I’ve tried a whole bunch of more clever things, between various random search approaches, simulated annealing, block sorting (basically: build a new tournament with chunks of the existing one and recurse) and a variety of local search techniques, but they generally seemed to be significantly slower and/or less effective.

Update: The test suite is now failing again. This is partly because I’ve tightened the requirements (down to within 3% rather than 5% of best known result), partly because after some tinkering with code I’ve managed to improve the best known quality of several test cases. This is actually quite pleasant, as it shows that there are some nicely tunable parameters in here.

In a fit of agility, I added a page for mobile devices to get some of the d8taplex news info: d8taplex.com/m/hapaxPage.html. This shows the articles crawled from Reuters which are getting a reasonable amount of bit.ly juice. Note that currently you have to navigate there explicitly.

The page is extremely simple.

Mobile
Check out Jeremy Singer-Vine's page as well for top headlines. In comparison to using click data to determine 'top' articles, it uses the editorial position of articles on the news sources it crawls.

For some reason, a number of projects are coming out of the closet this week. I mentioned Reuters 'Social Pulse' briefly already (not to self: write post describing how while many Reuters journalists have twitter accounts, no-one is tweeting...). Here is another : topheadlin.es (via Nieman Journalism Lab).

The site is, reportedly, a side project by Jeremy Singer-Vine, from The Wall Street Journal. The site, designed for mobile form factor, aggregates 'top' headlines from a small number of sites. From a motivation and design perspective, I really like it - keep it simple, tell me what I need to know. However, from an analytical side, it suffers from not really understanding the content that it is aggregating.

For example, right now it looks like this:

Topheadlines

Did the Giant's win (something)?

My experimental site - d8taplex news - attempts to avoid this problem by (currently) sourcing from a single news agency, and by using some simple clustering. That being said, I should probably work on a mobile version of the page if I want to compete!

via The mloss.org community blog von Soeren Sonnenburg am 03.02.12

I did some minor maintenance work today updating email addresses of certain projects and merging libmlpack and MLPACK.

More importantly, I did add the jmlr mloss url to all mloss.org listed projects that have a corresponding jmlr publication. Since there seemed to be quite some backlog it might be worthwhile to shed some light how one gets flagged 'published in jmlr'.

Naturally, the condition is an corresponding accepted and already online jmlr mloss paper. In addition, you should remind your jmlr action editor to flag the project 'published in jmlr' by giving him the link to the abstract of your publication on the jmlr homepage. This is the field 'abs' under http://jmlr.csail.mit.edu/mloss/ . He will then add this cross-reference and your project benefits from the increased visibility.

I hope that makes the process more transparent and reduces the backlog - ohh and btw we are counting 29 JMLR-MLOSS submissions - keep it going :-)

via AI and Social Science - Brendan O'Connor von brendano am 02.02.12

When possible, I like to use R for its really, really good statistical visualization capabilities. I’m doing a modeling project in Python right now (R is too slow, bad at large data, bad at structured data, etc.), and in comparison to base R, the matplotlib library is just painful. I wrote a toy Metropolis sampler for a triangle distribution and all I want to see is whether it looks like it’s working. For the same dataset, here are histograms with default settings. (Python: pylab.hist(d), R: hist(d))

I want to know whether my Metropolis sampler is working; those two plots give a very different idea. Of course, you could say this is an unfair comparison, since matplotlib is only using 10 bins, while R is using 18 here — and it’s always important to vary the bin size a few times when looking at histograms. But R’s defaults really are better: it actually uses an adaptive bin size, and the heuristic worked, choosing a reasonable number for the data. The hist() manual says it’s from Sturges (1926). It’s hard to find other computer software that cites 100 year old papers for its design decisions — and where it matters. (Old versions of R used to yell at you when you made a pie chart, citing perceptual studies that humans are really bad at interpreting them (here). This is what originally made me love R.)

Second, R is much smarter about breakpoints. In the following plots, I’ve manually set the number of bins to 10, and then 30 for each.

The second one is now OK for matplotlib — it’s good enough to figure out what’s going on — though still a little lame. Why the gaps?

The problem is that my data are discrete — they’re all integers from 1 through 19 — and I think matplotlib is naively carving up that range into bins, which sometimes lumps together two integers, and sometimes gets zero of them. I understand this is the simple naive implementation, and you could say it’s my fault that I shouldn’t have used the pylab histogram function for this type of data — but it’s really not as good as whatever R is doing, which works rather well here, and I didn’t have to waste time thinking about the internals of the algorithm. For reference, here is the correct visualization of the data (R: plot(table(d))). Note that R’s original Sturges breakpoints did make one error: the first two values got combined into one bin.

Lessons: (1) always vary the bin sizes for histograms, especially if you’re using naive breakpoint selection, and (2) don’t ignore a century’s worth of statistical research on these issues. And since it’s hard to learn a century’s worth of statistics, just use R, where they’re compiled it in for you.