Skip to main content

Appendix A - Index RecordContainer Properties

Index field types

The exact Solr configuration of the individual index field types can change and differ slightly between ADITO versions.

If you are not sure which field type fits for an EntityField, the Solr AdminUI offers a way to test the analysis of the values ​​for individual field types. The names of the field types in the schema mostly match the field type name in lowercase, with the exceptions of TEXT and TEXT_NOSTOPWORDS, which start with 'adito_'.

Path example: http://localhost:8983/solr/#/test\_solr9/analysis

Path pattern: http://<solr-host-and-port>/solr/#/<collection>/analysis

The following section describes all currently available index field types, as they can be set in the indexFieldType property of the RecordFieldMappings of an IndexRecordContainer.

ADDRESS

Description:
Field type for address data such as country, city, postal code, or street names. This type normalizes input (e.g., by converting umlauts and special characters) and additionally generates phonetic tokens using RefinedSoundex. Synonyms are planned, but currently not active. The goal is robust match accuracy for different spellings and abbreviations, such as country codes.

Example:
Input:
Konrad-Zuse-Straße 4 DE - 84144 Geisenhausen
Tokens before phonetics:
"KonradZuseStrasse", "Konrad-Zuse-Strasse", "Konrad", "Zuse", "Strasse", "4", "DE", "-", "84144", "Geisenhausen"
Tokens with phonetics:
"k3089065030369030", "konradzusestrasse", "k308906", "konrad", "z5030", "zuse", "s369030", "strasse", "4", "d60", "de", "84144", "g403080308", "geisenhausen"

Solr field type: address
Content types: TEXT

Properties:

AttributeValueNotes
TypeTextField
TokenizerWhitespaceTokenizer
Lower-Caseingyes
Stopwordsno
ASCII-Foldingyesreplaces e.g. umlauts
NormalizationyesGermanNormalization
Word Delimiteryesfully active
Phonetic AnalysisyesRefinedSoundex
Synonymsplannednot yet active
Leading Wildcards Supportno

Solr configuration:

<!-- fieldType for Addresses -->
<fieldType name="address" className="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer className="solr.WhitespaceTokenizerFactory"/>
<filter className="solr.WordDelimiterGraphFilterFactory"
generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="1"
splitOnCaseChange="1" preserveOriginal="1"/>
<filter className="solr.SynonymGraphFilterFactory"
synonyms="lang/adito/address_synonyms.txt"
ignoreCase="true" expand="false"/>
<filter className="solr.GermanNormalizationFilterFactory"/>
<filter className="solr.ASCIIFoldingFilterFactory"/>
<filter className="solr.PhoneticFilterFactory" encoder="RefinedSoundex"/>
<filter className="solr.LowerCaseFilterFactory"/>
<filter className="solr.FlattenGraphFilterFactory"/>
<filter className="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer className="solr.WhitespaceTokenizerFactory"/>
<filter className="solr.WordDelimiterGraphFilterFactory"
generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="1"
splitOnCaseChange="1" preserveOriginal="1"/>
<filter className="solr.SynonymGraphFilterFactory"
synonyms="lang/adito/address_synonyms.txt"
ignoreCase="true" expand="false"/>
<filter className="solr.GermanNormalizationFilterFactory"/>
<filter className="solr.ASCIIFoldingFilterFactory"/>
<filter className="solr.PhoneticFilterFactory" encoder="RefinedSoundex"/>
<filter className="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

BOOLEAN

Description:
Primitive field type for boolean values. Accepts true or false. Values starting with 1, t, or T are interpreted as true, all others as false. No analysis or tokenization.

Example:
Input: true
Stored value: true
Input: 0
Stored value: false

Solr field type: boolean, booleans
Content types: BOOLEAN

Properties:

AttributeValueNotes
TypeBoolFieldPrimitive field

Solr configuration:

<!-- boolean type: "true" or "false" -->
<fieldType name="boolean" className="solr.BoolField" sortMissingLast="true"/>
<fieldType name="booleans" className="solr.BoolField" sortMissingLast="true" multiValued="true"/>

COMMUNICATION

Description:
General field type for communication data such as phone numbers, email addresses, and URLs. The input is treated as a single token and then split into components (e.g., domain, TLD, local parts) via the WordDelimiter filter. Includes ASCII-Folding and lowercasing. For pure phone numbers, the TELEPHONE field type is recommended.

Example:
Email input:
info@adito-software.de
Tokens: "info@adito-software.de", "info", "adito", "software", "de", "infoaditosoftwarede"

Phone number input:
+49 (8743) 9664-0
Tokens: "+49 (8743) 9664-0", "49", "8743", "9664", "0", "49874396640"

URL input:
https://www.adito.de/unternehmen/philosophie.html
Tokens: "https://www.adito.de/unternehmen/philosophie.html", "https", "www", "adito", "de", "unternehmen", "philosophie", "html", "httpswwwaditodeunternehmenphilosophiehtml"

Solr field type: communication_address
Content types: TEXT, EMAIL, TELEPHONE, LINK

Properties:

AttributeValueNotes
TypeTextField
TokenizerKeywordTokenizer
Lower-Caseingyes
Stopwordsno
ASCII-Foldingyes
Normalizationno
Word Delimiteryesfull functionality
Phonetic Analysisno
Synonymsno
Leading Wildcards Supportyes

Solr configuration:

<!-- general fieldType for email, urls and phone-numbers -->
<fieldType name="communication_address" className="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer className="solr.KeywordTokenizerFactory"/>
<filter className="solr.ASCIIFoldingFilterFactory"/>
<filter className="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" preserveOriginal="1"/>
<filter className="solr.LowerCaseFilterFactory"/>
<filter className="solr.ReversedWildcardFilterFactory" withOriginal="true" maxPosAsterisk="2" maxPosQuestion="1" minTrailing="2" maxFractionAsterisk="0"/>
<filter className="solr.FlattenGraphFilterFactory"/>
<filter className="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer className="solr.KeywordTokenizerFactory"/>
<filter className="solr.ASCIIFoldingFilterFactory"/>
<filter className="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1"/>
<filter className="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

DATE

Description:
Primitive field type for date values. Supports timestamps with millisecond precision in ISO format (yyyy-MM-dd'T'HH:mm:ss[.SSS]Z). Storage is as a point field for fast range queries.
See also: Solr Working with Dates

Example:
Input: 2022-03-15T14:22:00Z
Stored value: 2022-03-15T14:22:00Z

Solr field type: pdate, pdates
Content types: DATE

Properties:

AttributeValueNotes
TypeDatePointFieldPrimitive field

Solr configuration:

<!-- KD-tree versions of date fields -->
<fieldType name="pdate" className="solr.DatePointField" docValues="true"/>
<fieldType name="pdates" className="solr.DatePointField" docValues="true" multiValued="true"/>

DOUBLE

Description:
Primitive field type for 64-bit floating point numbers (double). Stored as a point field for efficient range and value queries.

Example:
Input: 3.14159
Stored value: 3.14159

Solr field type: pdouble, pdoubles
Content types: NUMBER

Properties:

AttributeValueNotes
TypeDoublePointFieldPrimitive field

Solr configuration:

<fieldType name="pdouble" className="solr.DoublePointField" docValues="true"/>
<fieldType name="pdoubles" className="solr.DoublePointField" docValues="true" multiValued="true"/>

EMAIL

Description:
Field type specifically for email addresses. Non-ASCII characters are normalized, special characters generate additional tokens. WordDelimiter splits on special characters except CamelCase.

Example:
Input: info@adito-online.de
Tokens: "info@adito-online.de", "info", "adito", "online", "de", "infoadito", "aditoonline", "onlinede", "infoaditoonlinede"

Solr field type: email_address
Content types: TEXT, EMAIL

Properties:

AttributeValueNotes
TypeTextField
TokenizerWhitespaceTokenizer
Lower-Caseingyes
Stopwordsno
ASCII-Foldingyes
Normalizationno
Word Delimiteryesexcept CamelCase
Phonetic Analysisno
Synonymsno
Leading Wildcards Supportyes

Solr configuration:

<fieldType name="email_address" className="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer className="solr.WhitespaceTokenizerFactory"/>
<filter className="solr.ASCIIFoldingFilterFactory"/>
<filter className="solr.LowerCaseFilterFactory"/>
<filter className="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="0" preserveOriginal="1"/>
<filter className="solr.ReversedWildcardFilterFactory" withOriginal="true" maxPosAsterisk="2" maxPosQuestion="1" minTrailing="2" maxFractionAsterisk="0"/>
<filter className="solr.FlattenGraphFilterFactory" />
<filter className="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer className="solr.WhitespaceTokenizerFactory"/>
<filter className="solr.ASCIIFoldingFilterFactory"/>
<filter className="solr.LowerCaseFilterFactory"/>
<filter className="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="0" preserveOriginal="1"/>
</analyzer>
</fieldType>

HTML

Description:
Field type for HTML content. Internally treated like TEXT_NO_STOPWORDS, i.e., no stopword filtering and no special HTML analysis.

Example:
Input: <p>Willkommen bei ADITO!</p>
Tokens: "willkommen", "bei", "adito"

Solr field type: adito_text_nostopwords
Content types: TEXT, HTML

Properties:

AttributeValueNotes
TypeTextField
TokenizerStandardTokenizer
Lower-Caseingyes
Stopwordsno
ASCII-Foldingyes
NormalizationyesGerman
Word Delimiterno
Phonetic Analysisno
SynonymsyesSolr default
Leading Wildcards Supportno

Solr configuration:
see section TEXT_NO_STOPWORDS

INTEGER

Description:
Primitive field type for 32-bit signed integers (int). Stored as a point field for efficient range queries.

Example:
Input: 42
Stored value: 42

Solr field type: pint, pints
Content types: NUMBER

Properties:

AttributeValueNotes
TypeIntPointFieldPrimitive field

Solr configuration:

<fieldType name="pint" className="solr.IntPointField" docValues="true"/>
<fieldType name="pints" className="solr.IntPointField" docValues="true" multiValued="true"/>

LOCATION

Description:
Primitive field type for geographic coordinates (latitude/longitude pairs, format: lat,lon). Supports spatial search and distance calculation. Stored as a point field.
See also: Solr Spatial Search

Example:
Input: 48.123456,11.654321
Stored value: 48.123456,11.654321

Solr field type: location
Content types: TEXT

Properties:

AttributeValueNotes
TypeLatLonPointSpatialFieldPrimitive field

Solr configuration:

<!-- A specialized field for geospatial search filters and distance sorting. -->
<fieldType name="location" className="solr.LatLonPointSpatialField" docValues="true"/>

LONG

Description:
Primitive field type for 64-bit signed integers (long). Stored as a point field for efficient range queries.

Example:
Input: 12345678901234
Stored value: 12345678901234

Solr field type: plong, plongs
Content types: NUMBER, FILESIZE, DATE

Properties:

AttributeValueNotes
TypeLongPointFieldPrimitive field

Solr configuration:

<fieldType name="plong" className="solr.LongPointField" docValues="true"/>
<fieldType name="plongs" className="solr.LongPointField" docValues="true" multiValued="true"/>

LONG_TEXT

Description:
Field type for large text content such as PDFs with many pages or entire books.

This field type behaves like the TEXT type with three exceptions:

  1. Stopwords are already filtered during indexing.

  2. Separated words due to a line break are joined together again.

  3. The large property prevents the contents from being loaded into the (Solr) cache.

IN: A new infographic about our logo should …​
OUT: "neue"(2), "infografik"(3), "unser"(5), "logo"(6), "soll"(7),

Solr field type: adito_text_large

Content types: TEXT, FILE, HTML

Properties:

AttributeValueNotes
TypeTextField
TokenizerWhitespaceTokenizer
Lower-CaseingYES
StopwordsYES
ASCII-FoldingYES
NormalizationYES
Word DelimiterYESOnly special characters
PhoneticNO
SynonymsYESSolr default → Internally empty
Leading Wildcards SupportNO

The large attribute prevents the contents of the field from being loaded into the Solr cache. This increases search performance, as large texts, such as the entire content of a PDF, do not "clog" the cache.
However, fields with this attribute cannot be multiValued!

Solr Schema

    <fieldType name="adito_text_large" className="solr.TextField" positionIncrementGap="100" multiValued="true" large="true">
<analyzer type="index">
<tokenizer className="solr.WhitespaceTokenizerFactory"/>
<filter className="solr.HyphenatedWordsFilterFactory"/>
<filter className="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="0"/>
<filter className="solr.StopFilterFactory" ignoreCase="true" words="lang/adito/stopwords_mixed.txt" format="snowball"/>
<filter className="solr.GermanNormalizationFilterFactory"/>
<filter className="solr.ASCIIFoldingFilterFactory"/>
<filter className="solr.LowerCaseFilterFactory"/>
<filter className="solr.FlattenGraphFilterFactory"/>
<filter className="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer className="solr.StandardTokenizerFactory"/>
<filter className="solr.StopFilterFactory" ignoreCase="true" words="lang/adito/stopwords_mixed.txt" format="snowball"/>
<filter className="solr.GermanNormalizationFilterFactory"/>
<filter className="solr.ASCIIFoldingFilterFactory"/>
<filter className="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter className="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

PHONETIC_NAME

Description:
Field type for phonetic content such as personal names. This type uses a phonetic filter (BeiderMorseFilter) to analyze the content. This enables matches for terms or names that sound similar, e.g., "Meier", "Maier", and "Mayer".

IN: Tim Meier
OUT: "tim"(1), "tn"(1), "mDr"(2)

IN: Tim Maier
OUT: "tim"(1), "tn"(1), "mDr"(2)

Solr field type: phonetic_name

Content types: TEXT

Properties:

AttributeValueNotes
TypeTextField
TokenizerStandardTokenizer
Lower-CaseingYES
StopwordsNO
ASCII-FoldingNO
NormalizationNO
Word DelimiterYESCamelCase
PhoneticYESBeiderMorse
SynonymsYESCurrently empty
Leading Wildcards SupportYES

Solr Schema

    <fieldType name="phonetic_name" className="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer className="solr.StandardTokenizerFactory"/>
<filter className="solr.SynonymGraphFilterFactory" synonyms="lang/adito/pers_name_synonyms.txt" ignoreCase="true"/>
<filter className="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1"/>
<filter className="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true" languageSet="auto" />
<filter className="solr.ReversedWildcardFilterFactory" withOriginal="true" maxPosAsterisk="2" maxPosQuestion="1" minTrailing="2" maxFractionAsterisk="0"/>
<filter className="solr.FlattenGraphFilterFactory" />
<filter className="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer className="solr.StandardTokenizerFactory"/>
<filter className="solr.SynonymGraphFilterFactory" synonyms="lang/adito/pers_name_synonyms.txt" ignoreCase="true"/>
<filter className="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1"/>
<filter className="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true" languageSet="auto" />
</analyzer>
</fieldType>

PROPER NAME

Description:
Field type for proper names such as company names. The content is first normalized (umlauts & non-ASCII characters) and then analyzed with a simple phonetic filter (DoubleMetaphoneFilter).

IN: quick-mix
OUT: "quickmix"(1), "KKMK"(1), "quick-mix"(1), "quick"(1), "KK"(1), "mix"(1), "MKS"(1)

Solr field type: proper_name

Content types: TEXT

Properties:

AttributeValueNotes
TypeTextField
TokenizerStandardTokenizer
Lower-CaseingYES
StopwordsNO
ASCII-FoldingYES
NormalizationYES
Word DelimiterYESFULL
PhoneticYESDoubleMetaphone
SynonymsNO
Leading Wildcards SupportYES

Solr Schema

    <fieldType name="proper_name" className="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer className="solr.WhitespaceTokenizerFactory"/>
<filter className="solr.GermanNormalizationFilterFactory"/>
<filter className="solr.ASCIIFoldingFilterFactory"/>
<filter className="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" preserveOriginal="1"/>
<filter className="solr.LowerCaseFilterFactory"/>
<filter className="solr.DoubleMetaphoneFilterFactory"/>
<filter className="solr.ReversedWildcardFilterFactory" withOriginal="true" maxPosAsterisk="2" maxPosQuestion="1" minTrailing="2" maxFractionAsterisk="0"/>
<filter className="solr.FlattenGraphFilterFactory" />
<filter className="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer className="solr.WhitespaceTokenizerFactory"/>
<filter className="solr.GermanNormalizationFilterFactory"/>
<filter className="solr.ASCIIFoldingFilterFactory"/>
<filter className="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" preserveOriginal="1"/>
<filter className="solr.LowerCaseFilterFactory"/>
<filter className="solr.DoubleMetaphoneFilterFactory"/>
</analyzer>
</fieldType>

STRING

Description:
Primitive field type for short strings (UTF-8). No analysis or tokenization – the value is stored as-is. Suitable for fields up to ~32 KB.

Example:
Input: ADITO123
Stored value: ADITO123

Solr field type: string, strings
Content types: TEXT, BOOLEAN, DATE

Properties:

AttributeValueNotes
TypeStrFieldno analysis, stored as given

Solr configuration:

<!-- The StrField type is not analyzed, but indexed/stored verbatim. -->
<fieldType name="string" className="solr.StrField" sortMissingLast="true" docValues="true" />
<fieldType name="strings" className="solr.StrField" sortMissingLast="true" multiValued="true" docValues="true" />

TELEPHONE

Description:
Field type optimized for phone numbers.

This type can handle area codes and + signs. The digits of the number are concatenated (e.g.: +49 871 123456 → 0049871123456) and then additional sub-numbers (n-grams) are generated.

IN: +49 (8743) 9664-0
OUT: "004987439660"(1), "00498743966"(1), "0049874396"(1), "004987439"(1), …​ "04987439660"(1), "4987439660"(1), "987439660"(1), "87439660"(1), …​ "004"(1), "049"(1), "498"(1) …​ "966"(1), "660"(1)

Solr field type: phone_number

Content types: TEXT, TELEPHONE

Properties:

AttributeValueNotes
TypeTextField
TokenizerKeywordTokenizer
Lower-CaseingYES
StopwordsNO
ASCII-FoldingNO
NormalizationNO
Word DelimiterYESOnly for numbers
PhoneticNO
SynonymsNO
Leading Wildcards SupportNO

Solr Schema

    <fieldType name="phone_number" className="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer className="solr.KeywordTokenizerFactory"/>
<filter className="solr.LowerCaseFilterFactory"/>
<filter className="solr.PatternReplaceFilterFactory" pattern="^[+]" replacement="00" replace="first"/>
<filter className="solr.PatternReplaceFilterFactory" pattern="^0([^0])" replacement="$1" replace="first"/>
<filter className="solr.PatternReplaceFilterFactory" pattern="\s" replacement="-"/>
<filter className="solr.WordDelimiterGraphFilterFactory" generateWordParts="0" generateNumberParts="1" catenateWords="0" catenateNumbers="1" catenateAll="1" splitOnCaseChange="0" preserveOriginal="0"/>
<filter className="solr.NGramFilterFactory" minGramSize="3" maxGramSize="30"/>
<filter className="solr.FlattenGraphFilterFactory" />
<filter className="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer className="solr.KeywordTokenizerFactory"/>
<filter className="solr.LowerCaseFilterFactory"/>
<filter className="solr.PatternReplaceFilterFactory" pattern="^[+]" replacement="00" replace="first"/>
<filter className="solr.PatternReplaceFilterFactory" pattern="^0([^0])" replacement="$1" replace="first"/>
<filter className="solr.PatternReplaceFilterFactory" pattern="\s" replacement="-"/>
<filter className="solr.WordDelimiterGraphFilterFactory" generateWordParts="0" generateNumberParts="1" catenateWords="0" catenateNumbers="1" catenateAll="1" splitOnCaseChange="0" preserveOriginal="0"/>
</analyzer>
</fieldType>

TEXT

Description:
Type for standard text.

The content is normalized (umlauts & non-ASCII characters). Terms with special characters and CamelCase are additionally split.

During search, stopwords are filtered; however, if the pattern only contains stopwords (e.g., 'AT', which is also a country code), the stopword filter is ignored.

Example: Indexing

IN: Eine neue Infografik über unser Logo soll …​
OUT: "eine"(1), "neue"(2), "infografik"(3), "uber"(4), "unser"(5), "logo"(6), "soll"(7),

Example: Searching

IN: Eine neue Infografik über unser Logo soll …​
OUT: "neue"(2), "infografik"(3), "logo"(6), "soll"(7),

Example: Searching only stopwords

IN: Sein oder nicht sein
OUT: "sein"(1), "oder"(2), "nicht"(3), "sein"(4),

Solr field type: adito_text

Content types: TEXT, FILE, HTML

Properties:

AttributeValueNotes
TypeTextField
TokenizerStandardTokenizer
Lower-CaseingYES
StopwordsYES
ASCII-FoldingYES
NormalizationYESGerman
Word DelimiterYESonly word splitting
PhoneticNO
SynonymsYESSolr default → Internally empty
Leading Wildcards SupportNO

Solr Schema

    <!-- Default ADITO text field used by dynamic schema -->
<fieldType name="adito_text" className="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer className="solr.StandardTokenizerFactory"/>
<filter className="solr.GermanNormalizationFilterFactory"/>
<filter className="solr.ASCIIFoldingFilterFactory"/>
<filter className="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1"/>
<filter className="solr.LowerCaseFilterFactory"/>
<filter className="solr.FlattenGraphFilterFactory"/>
<filter className="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer className="solr.StandardTokenizerFactory"/>
<filter className="solr.SuggestStopFilterFactory" ignoreCase="true" words="lang/adito/stopwords_mixed.txt" format="snowball"/>
<filter className="solr.GermanNormalizationFilterFactory"/>
<filter className="solr.ASCIIFoldingFilterFactory"/>
<filter className="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter className="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

TEXT_NO_STOPWORDS

Description:
Standard text field (TEXT) without stopword filtering.

Example

IN: Eine neue Infografik über unser Logo soll …​
OUT: "eine"(1), "neue"(2), "infografik"(3), "uber"(4), "unser"(5), "logo"(6), "soll"(7),

Solr field type: adito_text_nostopwords

Content types: TEXT, FILE, HTML

Properties:

AttributeValueNotes
TypeTextField
TokenizerStandardTokenizer
Lower-CaseingYES
StopwordsNO
ASCII-FoldingYES
NormalizationYESGerman
Word DelimiterNO
PhoneticNO
SynonymsYESSolr default → Internally empty
Leading Wildcards SupportNO

Solr Schema

    <fieldType name="adito_text_nostopwords" className="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer className="solr.StandardTokenizerFactory"/>
<filter className="solr.GermanNormalizationFilterFactory"/>
<filter className="solr.ASCIIFoldingFilterFactory"/>
<filter className="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer className="solr.StandardTokenizerFactory"/>
<filter className="solr.GermanNormalizationFilterFactory"/>
<filter className="solr.ASCIIFoldingFilterFactory"/>
<filter className="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter className="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

TEXT_PLAIN

Description:
Field type for texts whose content should not be analyzed.
This type only eliminates punctuation and transforms the text to lowercase.

This field type treats 'ä', 'ö', 'ü', and 'ß' as distinct characters.

Example

IN: Neue ADITO Schreibblöcke!
OUT "neue"(1) "adito"(2) "schreibblöcke"(3)

Solr field type: text_plain

Content types: TEXT, FILE, HTML

Properties:

AttributeValueNotes
TypeTextField
TokenizerStandardTokenizer
Lower-CaseingYES
StopwordsNO
ASCII-FoldingNO
NormalizationNO
Word DelimiterNO
PhoneticNO
SynonymsYESSolr default → Internally empty
Leading Wildcards SupportNO

Solr Schema

    <fieldType name="text_plain" className="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer className="solr.StandardTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter className="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<filter className="solr.FlattenGraphFilterFactory"/>
-->
<filter className="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer className="solr.StandardTokenizerFactory"/>
<filter className="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter className="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter className="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

Tokenizer

Tokenizers are responsible for splitting an input text into individual tokens. They operate at the character level and produce a TokenStream, which is further processed by an analyzer. Unlike an analyzer, a tokenizer does not know the field context and only processes the raw format.

The following tokenizers are used in various ADITO field types:

WhitespaceTokenizer

Splits text exclusively at whitespace. Punctuation and special characters are retained.

Input:
"To be, or what?"
Token output:
"To", "be,", "or", "what?"

StandardTokenizer

Splits text at whitespace and most punctuation and special characters. Some characters, such as dots within domains or numeric formats, are not split. The @ character is a separator, so email addresses are fragmented.

Input:
"Please, email john.doe@foo.com by 03-09, re: m37-xq."
Token output:
"Please", "email", "john.doe", "foo.com", "by", "03", "09", "re", "m37", "xq"

KeywordTokenizer

Reads the entire input text as a single token. Used when no splitting should occur – e.g., for phone numbers, IDs, or strings to be stored exactly as entered.

Input:
"Please, email john.doe@foo.com by 03-09, re: m37-xq."
Token output:
"Please, email john.doe@foo.com by 03-09, re: m37-xq."

Filter

Filters process token streams after the tokenizer. They transform, discard, or expand the tokens depending on their function. The filter chain is crucial for the behavior of the field type.

LowerCaseFilter

Converts all letters in a token to lowercase. Other characters remain unchanged.

Example:
Input: "ADITO"
Output: "adito"


ASCIIFoldingFilter

Converts all non-ASCII characters to their ASCII equivalents – e.g., diacritics (umlauts, accents).

Example:
Input: "français, südlich"
Output: "francais", "sudlich"


GermanNormalizationFilter

Normalizes German umlauts, ß, and similar spelling variants. The filter is based on the German2 Snowball algorithm.

Transformations:

  • ä, aea
  • ö, oeo
  • ü, ueu
  • ßss

WordDelimiterGraphFilter

Splits tokens at word and character boundaries. Typical splits occur at CamelCase, numeric transitions, or hyphens.

Example:
Input: "hotSpot-XL42"
Output: "hot", "Spot", "XL", "42", "hotSpot", "XL42", "hotSpotXL42"

Configurable via:

  • splitOnCaseChange
  • splitOnNumerics
  • preserveOriginal
  • catenateWords / Numbers / All

PhoneticFilter

Converts tokens into phonetic codes. Supported algorithms:

  1. DoubleMetaphone
    DoubleMetaphoneFilter
    – for proper names like "Meyer" / "Meier"

  2. RefinedSoundex
    PhoneticFilterFactory
    – simple syllable encoding

  3. BeiderMorse
    BeiderMorseFilterFactory
    – designed for personal/last names
    – higher precision than Soundex


SynonymGraphFilter

Assigns defined synonyms to existing tokens. Enables semantically equivalent search queries. ADITO currently uses empty synonym lists. The feature is prepared but not active.

Example configuration:

<filter class="solr.SynonymGraphFilterFactory"
synonyms="mysynonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.FlattenGraphFilterFactory"/>

Synonym list mysynonyms.txt:

couch,sofa,divan
teh => the
huge,ginormous,humungous => large
small => tiny,teeny,weeny

Input: "teh small couch"
Output: "the", "tiny", "teeny", "weeny", "couch", "sofa", "divan"


ReversedWildcardFilter

Enables efficient search queries with leading wildcards (*foo). Tokens are indexed in reverse.

Input: "*bar"
Output: "rab*"
Tokens without wildcards remain unchanged.


StopFilter

Filters defined stopwords out of the token stream. ADITO uses a combined German-English stopword list (lang/adito/stopwords_mixed.txt).

Example:
Input: "To be or what?"
Tokens before filter: "To", "be", "or", "what"
Tokens after filter: "what"

Stopwords

A list of German and English stopwords is used.

lang/adito/stopwords_mixed.txt

| NOTE: To use this file with StopFilterFactory, you must specify format="snowball"

| Comments begin with vertical bar. Each stop word is at the start of a line.

| German stop word list.

aber | but

alle | all allem allen aller alles

als | than, as also | so am | an + dem an | at

ander | other andere anderem anderen anderer anderes anderm andern anderr anders

auch | also auf | on aus | out of bei | by bin | am bis | until bist | art da | there damit | with it dann | then

der | the den des dem die das

daß | that

derselbe | the same derselben denselben desselben demselben dieselbe dieselben dasselbe

dazu | to that

dein | thy deine deinem deinen deiner deines

denn | because

derer | of those dessen | of him

dich | thee dir | to thee du | thou

dies | this diese diesem diesen dieser dieses

doch | (several meanings) dort | (over) there

durch | through

ein | a eine einem einen einer eines

einig | some einige einigem einigen einiger einiges

einmal | once

er | he ihn | him ihm | to him

es | it etwas | something

euer | your eure eurem euren eurer eures

für | for gegen | towards gewesen | p.p. of sein hab | have habe | have haben | have hat | has hatte | had hatten | had hier | here hin | there hinter | behind

ich | I mich | me mir | to me

ihr | you, to her ihre ihrem ihren ihrer ihres euch | to you

im | in + dem in | in indem | while ins | in + das ist | is

jede | each, every jedem jeden jeder jedes

jene | that jenem jenen jener jenes

jetzt | now kann | can

kein | no keine keinem keinen keiner keines

können | can könnte | could machen | do man | one

manche | some, many a manchem manchen mancher manches

mein | my meine meinem meinen meiner meines

mit | with muss | must musste | had to nach | to(wards) nicht | not nichts | nothing noch | still, yet nun | now nur | only ob | whether oder | or ohne | without sehr | very

sein | his seine seinem seinen seiner seines

selbst | self sich | herself

sie | they, she ihnen | to them

sind | are so | so

solche | such solchem solchen solcher solches

soll | shall sollte | should sondern | but sonst | else über | over um | about, around und | and

uns | us unse unsem unsen unser unses

unter | under viel | much vom | von + dem von | from vor | before während | while war | was waren | were warst | wast was | what weg | away, off weil | because weiter | further

welche | which welchem welchen welcher welches

wenn | when werde | will werden | will wie | how wieder | again will | want wir | we wird | will wirst | willst wo | where wollen | want wollte | wanted würde | would würden | would zu | to zum | zu + dem zur | zu + der zwar | indeed zwischen | between

| English stop word list

a an and are as at be but by for if in into is it no not of on or such that the their then there these they this to was will with