Creating a custom analyzer

Most analysis customization is in the createComponents class, where the Tokenizer and TokenFilters are defined.

CharFilters can be added in the initReader method.

Analyzer analyzer = new Analyzer() {
    protected Reader initReader(String fieldName, Reader reader) {
        return new HTMLStripCharFilter(reader);
    protected TokenStreamComponents createComponents(String fieldName) {
        Tokenizer tokenizer = new StandardTokenizer();
        TokenStream stream = new StandardFilter(tokenizer);
        //Order matters!  If LowerCaseFilter and StopFilter were swapped here, StopFilter's
        //matching would be case sensitive, so "the" would be eliminated, but not "The"
        stream = new LowerCaseFilter(stream);
        stream = new StopFilter(stream, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
        return new TokenStreamComponents(tokenizer, stream);

Iterating manually through analyzed tokens

TokenStream stream = myAnalyzer.tokenStream("myField", textToAnalyze);
while(stream.incrementToken()) {
    CharTermAttribute token = stream.getAttribute(CharTermAttribute.class);


A number of Attributes are available. The most common is CharTermAttribute, which is used to get the analyzed term as a String.