Processing Tweets with LingPipe

Processing Tweets with LingPipe #3: Near duplicate detection and evaluation « LingPipe Blog

The duplicate detection problem in Twitter is really about word overlap with a slight game of telephone quality to it. The elaborations of previous tweets tend to be leading or following comments with the core of the source tweet preserved. Not much rephrasing is going on so word overlap between tweets is the obvious place to go. That entails a few elaborations to our approach:

1. Find words in the tweets: Tokenization
2. Measure similarity of tweets without sensitivity to tweet length: Jaccard Distance

Processing Tweets with LingPipe

Related Posts

Analyzing Charts with Ollama LLaVA: A Practical Guide

Honker: SQLite Extension for Pub/Sub and Durable Queues