"machine learning" via mikiobraun
I was thinking about the problem of authorship attribution... Have people thought about the flip side of this problem? Namely, "anonymizing" text so that it would be hard to attribute it to any author?This is something I've actually wondered about in the context of blogging for a while. I noticed at some point that my "blogger voice" is very similar to my "reviewer voice" and started worrying that I might be too identifiable as a reviewer. This might either be due to lexical choice ("bajillion" or "awesome") or due to some more subtle stylistic choices.
There is quite a bit of work on authorship attribution. I think the first time I heard a talk on this topic was on March 24, 2004, when Shlomo Argamon gave a talk at ISI (no, I don't have an amazing memory, I cheated) on "On Writing, Our Selves: Explorations in Stylistic Text Categorization." The basic hypothesis of the talk, at least as I remember it, was that if you're trying to do authorship attribution, you should throw out content words and focus on things like POS tag sequences, parse tree structures, and things like that.
There's been a lot of subsequent work in this, and related areas. One very related area is on things like trying to predict demographic information (age, gender, socio-economic status, education level, and, yes, astrological sign) from tweets, blog posts or emails (or other forms). One of the key distinctions that I think is important in all of this work is whether the original author is intentionally trying to hide information about him or herself. For instance, someone trying to impersonate Shakespeare, or a child predator pretending to be a different age or gender, or a job applicant trying to sound more educate than is true. This latter is a much harder problem because the stupid topically stereotypical features that pop out as being indicative (like men talking about "wifes" and "football" and women talking about "husbands" and "yoga") and the silly features that don't really tell us anything interesting (on twitter, apparently men tend to put "http://" before URLs more than women -- who knew?) because these "pretenders" are going to intentionally try to hide that information (now that everyone knows to hide "http://" to trick gender recognizers!). It also means that falling back on topic as a surrogate for demography should not work as well. This seems to be a very different problem from trying to identify whether a blog post is written by me or by Jon, which should be 99.9% do-able by just looking at content words.
The reason I bring this all up is because we don't want to anonymize by changing the topic. The topic needs to stay the same: we just need to cut out additional identifying information. So, getting back to Jon's question, the most relevant work that I know of is on text steganography (by Ching-Yun Chang and Stephen Clark), where they use the ability to do paraphrasing to encode messages in text. Aside from the challenge of making the output actually somewhat grammatical, the basic idea is that when you have two ways of saying the same thing (via paraphases), you can choose the first one to encode a "0" and the second to encode a "1" and then use this to encode a message in seemingly-natural text.
I also remember having a conversation a while ago while a (different) colleague about trying to build a chat system where you could pretend that you're chatting with someone famous (like Obama or Harry Potter or Scooby Doo). A similar problem is trying to paraphrase my own writing to sound like someone else, but zoinks, that seems hard! A basic approach would be to build a Scooby Doo language model (SDLM) and then run my blog posts through a paraphrase engine that uses the SDLM for producing the output. My vague sense is that this would work pretty poorly, primarily because the subtleness in phrase structure selection would be lost on a highly-lexicalized language model. I imagine you'd get some funny stuff out and it might be amusing to do, but I don't have time to try.
As far as pure anonymization goes, it seems like doing something similar to the steganography approach would work. Here, what you could do is generate a random sequence of bits, and then "encode" that random sequence using the steganography system. This would at least remove some identifying information. But the goal of the steganography isn't to change every phrase, but just to change enough phrases that you can encode your message. It also wouldn't solve the problem that perhaps you can identifying a bit about an author by the lengths of their sentences. Or their oscillation between long and short sentences. This also wouldn't be hidden.
An alternative, human-in-the-loop approach might be simply to have an authorship recognition system running in your word processor, and then any time you type something that enables it to identify you, it could highlight it and you could be tasked with changing it. I suspect this would be a frustrating, but fairly interesting experience (at least the first time).
p.s., I'm now officially tweeting on @haldaume3.
I’ve released a new version of my software for Low Density Parity Check (LDPC) codes. These error-correcting codes were invented by Robert Gallager in the early 1960′s, and re-invented and shown to have very good performance by David MacKay and myself in the mid-1990′s. The decoding algorithm for LDPC codes is related to that used for Turbo codes, and to probabilistic inference methods used in other fields. Variations on LDPC and Turbo codes are currently the best practical codes known, in terms of their ability to transmit data at rates approaching channel capacity with very low error probability.
This new version has only a few bug fixes and minor enhancements. The big change is that I’ve put up the source code as a Github repository, that uses the git source code control system. This should make it easier for other people to create their own extensions of the software. The software is available here, and the Github repository is here.
For me, this is also an exercise in learning about git and Github. From my initial experience, git does seem better than the source code control systems I’ve used previously (which are Subversion, CVS, and the “modify” utility of the Kronos operating system for the CDC 6000 series of computers). The help pages are a bit cryptic in places, though, at least for the novice user…
The current version of the d8taplex news site provides a mapping between news article authors and their Twitter accounts. In addition, I've been experimenting with mining all the tweets that link to a specific news article. This has lead to a couple of observations.
- Reuters contributors with Twitter accounts don't tweet very much. I can quantify this in a later post, but if you look on the d8taplex site at any time, you will rarely see the icon that indicates a journalist has recently tweeted (
). - Twitter users who post links to news articles rarely contribute any useful hashtags. The version of the site that I have running locally accumulates all the hashtags associated with an article and they invariably tend to simply be the main nouns in the article title. I was hoping for some sort of magic here but it doesn't look like it's there. For example, an article with the title 'Exclusive: Alibaba plans to take Hong Kong unit private' has the most frequent hashtag 'alibaba'.
The next stones I plan to look under include: aggregate analysis of tweets to better understand the people who tweet news links and crawling conspicuous blogs to see if anything interesting happens when 'influential' bloggers link to source news material. Stay tuned.
Sequence optimization, where the items in a list are ordered to maximize some reward has many applications such as web advertisement placement, search, and control libraries in robotics. Previous work in sequence optimization produces a static ordering that does not take any features of the item or context of the problem into account. In this work, we propose a general approach to order the items within the sequence based on the context (e.g., perceptual information, environment description, and goals). We take a simple, efficient, reduction-based approach where the choice and order of the items is established by repeatedly learning simple classifiers or regressors for each "slot" in the sequence. Our approach leverages recent work on submodular function maximization to provide a formal regret reduction from submodular sequence optimization to simple cost-sensitive prediction. We apply our contextual sequence prediction algorithm to optimize control libraries and demonstrate results on two robotics problems: manipulator trajectory prediction and mobile robot path planning.
We describe Information Forests, an approach to classification that generalizes Random Forests by replacing the splitting criterion of non-leaf nodes from a discriminative one -- based on the entropy of the label distribution -- to a generative one -- based on maximizing the information divergence between the class-conditional distributions in the resulting partitions. The basic idea consists of deferring classification until a measure of "classification confidence" is sufficiently high, and instead breaking down the data so as to maximize this measure. In an alternative interpretation, Information Forests attempt to partition the data into subsets that are "as informative as possible" for the purpose of the task, which is to classify the data. Classification confidence, or informative content of the subsets, is quantified by the Information Divergence. Our approach relates to active learning, semi-supervised learning, mixed generative/discriminative learning.
Suppose that we observe noisy linear measurements of an unknown signal that can be modeled as the sum of two component signals, each of which arises from a nonlinear sub-manifold of a high dimensional ambient space. We introduce SPIN, a first order projected gradient method to recover the signal components. Despite the nonconvex nature of the recovery problem and the possibility of underdetermined measurements, SPIN provably recovers the signal components, provided that the signal manifolds are incoherent and that the measurement operator satisfies a certain restricted isometry property. SPIN significantly extends the scope of current recovery models and algorithms for low dimensional linear inverse problems and matches (or exceeds) the current state of the art in terms of performance.
We consider the problem of finding the graph on which an epidemic cascade spreads, given only the times when each node gets infected. While this is a problem of importance in several contexts -- offline and online social networks, e-commerce, epidemiology, vulnerabilities in infrastructure networks -- there has been very little work, analytical or empirical, on finding the graph. Clearly, it is impossible to do so from just one cascade; our interest is in learning the graph from a small number of cascades.
For the classic and popular "independent cascade" SIR epidemics, we analytically establish the number of cascades required by both the global maximum-likelihood (ML) estimator, and a natural greedy algorithm. Both results are based on a key observation: the global graph learning problem decouples into $n$ local problems -- one for each node. For a node of degree $d$, we show that its neighborhood can be reliably found once it has been infected $O(d^2 \log n)$ times (for ML on general graphs) or $O(d\log n)$ times (for greedy on trees). We also provide a corresponding information-theoretic lower bound of $\Omega(d\log n)$; thus our bounds are essentially tight. Furthermore, if we are given side-information in the form of a super-graph of the actual graph (as is often the case), then the number of cascade samples required -- in all cases -- becomes independent of the network size $n$.
Finally, we show that for a very general SIR epidemic cascade model, the Markov graph of infection times is obtained via the moralization of the network graph.
We propose a new yet natural algorithm for learning the graph structure of general discrete graphical models (a.k.a. Markov random fields) from samples. Our algorithm finds the neighborhood of a node by sequentially adding nodes that produce the largest reduction in empirical conditional entropy; it is greedy in the sense that the choice of addition is based only on the reduction achieved at that iteration. Its sequential nature gives it a lower computational complexity as compared to other existing comparison-based techniques, all of which involve exhaustive searches over every node set of a certain size. Our main result characterizes the sample complexity of this procedure, as a function of node degrees, graph size and girth in factor-graph representation. We subsequently specialize this result to the case of Ising models, where we provide a simple transparent characterization of sample complexity as a function of model and graph parameters.
For tree graphs, our algorithm is the same as the classical Chow-Liu algorithm, and in that sense can be considered the extension of the same to graphs with cycles.
