Processing Tweets with LingPipe


Processing Tweets with LingPipe #3: Near duplicate detection and evaluation « LingPipe Blog

The duplicate detection problem in Twitter is really about word overlap with a slight game of telephone quality to it. The elaborations of previous tweets tend to be leading or following comments with the core of the source tweet preserved. Not much rephrasing is going on so word overlap between tweets is the obvious place to go. That entails a few elaborations to our approach:

1. Find words in the tweets: Tokenization
2. Measure similarity of tweets without sensitivity to tweet length: Jaccard Distance

Related Posts