Extracting article text from HTML documents
Category:
Overview: Extracting article text from HTML documents | My tech blog.
In the world of web scraping, text mining and article reading utilities (readability bookmarklet) there is an ever growing demand for utilities that are capable of distinguishing parts of a HTML document which represent an article apart from other common website building blocks like menus, headers, footers, ads etc.

Endeca Design Pattern Library
Category:
Endeca Design Pattern Library
The Endeca User Interface Design Pattern Library (UIDPL) describes principled ways to solve common user interface design problems related to search, faceted navigation, and discovery.

The library includes both specific UI design patterns as well as topics -- groups of patterns related to significant aspects of search and discovery

Twitter’s Plan to Analyze 100 Billion Tweets
Category:
High Scalability - High Scalability - Twitter’s Plan to Analyze 100 Billion Tweets
If Twitter is the “nervous system of the web” as some people think, then what is the brain that makes sense of all those signals (tweets) from the nervous system? That brain is the Twitter Analytics System

ElasticSearch
Category:
ElasticSearch - ElasticSearch Overview
Search Engines data model roots lies with schema free and document oriented databases, and as shown by the #nosql movement, this model proves to be very effective for building applications.

Elastic Search model is JSON, which slowly emerges as the de-facto standard for representing data these days. More over, with JSON, it is simple to provide semi-structured data with complex entities as well as being programming language natural with first level parsers.

Auto-Suggest From Popular Queries Using EdgeNGrams
Category:
Auto-Suggest From Popular Queries Using EdgeNGrams
A popular feature of most modern search applications is the auto-suggest or auto-complete feature where, as a user types their query into a text box, suggestions of popular queries are presented. As each additional character is typed in by the user the list of suggestions is refined. There are several different approaches in Solr to provide this functionality, but we will be looking at an approach that involves using EdgeNGrams as part of the analysis chain.

Lucene Image REtrieval
Category:
freshmeat.net: Project details for Lucene Image REtrieval
The LIRE (Lucene Image REtrieval) library provides a simple way to create a Lucene index of image features for content-based image retrieval (CBIR), which allows searching for similar images. The used features are taken from the MPEG-7 standard: ScalableColor, ColorLayout, and EdgeHistogram. Furthermore, methods for searching the index are provided.

SolrJS
Category:
SolrJS
A JQuery based ajaxian interface to the Solr search engine
What's new with Apache Solr
Category:
What's new with Apache Solr
Apache Solr has added many new features and performance improvements since the Search smarter with Apache Solr series was published. In this article, Solr and Lucene committer Grant Ingersoll details the improvements in Solr 1.3, including distributed search, easy database imports, integrated spell checking, new extension APIs, and much more.

Pagerank Explained
Category:
Pagerank Explained.
PageRank is Google's way of deciding a page's importance. It matters because it is one of the factors that determines a page's ranking in the search results. It isn't the only factor that Google uses to rank pages, but it is an important one.

This article goes into details of how the pagerank is generated.
Searchme Visual Search
Category:
Searchme Visual Search
An entertaining new search engine. Time will show about the rest...