Appendix A - Index RecordContainer Properties
Index field types
The exact Solr configuration of the individual index field types can change and differ slightly between ADITO versions.
If you are not sure which field type fits for an EntityField, the Solr AdminUI offers a way to test the analysis of the values for individual field types. The names of the field types in the schema mostly match the field type name in lowercase, with the exceptions of TEXT and TEXT_NOSTOPWORDS, which start with 'adito_'.
Path example: http://localhost:8983/solr/#/test\_solr9/analysis
Path pattern: http://<solr-host-and-port>/solr/#/<collection>/analysis
The following section describes all currently available index field types,
as they can be set in the indexFieldType
property of the
RecordFieldMappings of an IndexRecordContainer.
ADDRESS
Description:
Field type for address data such as country, city, postal code, or street names. This type normalizes input (e.g., by converting umlauts and special characters) and additionally generates phonetic tokens using RefinedSoundex
. Synonyms are planned, but currently not active. The goal is robust match accuracy for different spellings and abbreviations, such as country codes.
Example:
Input:
Konrad-Zuse-Straße 4 DE - 84144 Geisenhausen
Tokens before phonetics:
"KonradZuseStrasse"
, "Konrad-Zuse-Strasse"
, "Konrad"
, "Zuse"
, "Strasse"
, "4"
, "DE"
, "-"
, "84144"
, "Geisenhausen"
Tokens with phonetics:
"k3089065030369030"
, "konradzusestrasse"
, "k308906"
, "konrad"
, "z5030"
, "zuse"
, "s369030"
, "strasse"
, "4"
, "d60"
, "de"
, "84144"
, "g403080308"
, "geisenhausen"
Solr field type: address
Content types: TEXT
Properties:
Attribute | Value | Notes |
---|---|---|
Type | TextField | |
Tokenizer | WhitespaceTokenizer | |
Lower-Caseing | yes | |
Stopwords | no | |
ASCII-Folding | yes | replaces e.g. umlauts |
Normalization | yes | GermanNormalization |
Word Delimiter | yes | fully active |
Phonetic Analysis | yes | RefinedSoundex |
Synonyms | planned | not yet active |
Leading Wildcards Support | no |
Solr configuration:
<!-- fieldType for Addresses -->
<fieldType name="address" className="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer className="solr.WhitespaceTokenizerFactory"/>
<filter className="solr.WordDelimiterGraphFilterFactory"
generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="1"
splitOnCaseChange="1" preserveOriginal="1"/>
<filter className="solr.SynonymGraphFilterFactory"
synonyms="lang/adito/address_synonyms.txt"
ignoreCase="true" expand="false"/>
<filter className="solr.GermanNormalizationFilterFactory"/>
<filter className="solr.ASCIIFoldingFilterFactory"/>
<filter className="solr.PhoneticFilterFactory" encoder="RefinedSoundex"/>
<filter className="solr.LowerCaseFilterFactory"/>
<filter className="solr.FlattenGraphFilterFactory"/>
<filter className="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer className="solr.WhitespaceTokenizerFactory"/>
<filter className="solr.WordDelimiterGraphFilterFactory"
generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="1"
splitOnCaseChange="1" preserveOriginal="1"/>
<filter className="solr.SynonymGraphFilterFactory"
synonyms="lang/adito/address_synonyms.txt"
ignoreCase="true" expand="false"/>
<filter className="solr.GermanNormalizationFilterFactory"/>
<filter className="solr.ASCIIFoldingFilterFactory"/>
<filter className="solr.PhoneticFilterFactory" encoder="RefinedSoundex"/>
<filter className="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
BOOLEAN
Description:
Primitive field type for boolean values. Accepts true
or false
. Values starting with 1
, t
, or T
are interpreted as true
, all others as false
. No analysis or tokenization.
Example:
Input: true
Stored value: true
Input: 0
Stored value: false
Solr field type: boolean
, booleans
Content types: BOOLEAN
Properties:
Attribute | Value | Notes |
---|---|---|
Type | BoolField | Primitive field |
Solr configuration:
<!-- boolean type: "true" or "false" -->
<fieldType name="boolean" className="solr.BoolField" sortMissingLast="true"/>
<fieldType name="booleans" className="solr.BoolField" sortMissingLast="true" multiValued="true"/>
COMMUNICATION
Description:
General field type for communication data such as phone numbers, email addresses, and URLs. The input is treated as a single token and then split into components (e.g., domain, TLD, local parts) via the WordDelimiter filter. Includes ASCII-Folding and lowercasing. For pure phone numbers, the TELEPHONE
field type is recommended.
Example:
Email input:
info@adito-software.de
Tokens: "info@adito-software.de"
, "info"
, "adito"
, "software"
, "de"
, "infoaditosoftwarede"
Phone number input:
+49 (8743) 9664-0
Tokens: "+49 (8743) 9664-0"
, "49"
, "8743"
, "9664"
, "0"
, "49874396640"
URL input:
https://www.adito.de/unternehmen/philosophie.html
Tokens: "https://www.adito.de/unternehmen/philosophie.html"
, "https"
, "www"
, "adito"
, "de"
, "unternehmen"
, "philosophie"
, "html"
, "httpswwwaditodeunternehmenphilosophiehtml"
Solr field type: communication_address
Content types: TEXT
, EMAIL
, TELEPHONE
, LINK
Properties:
Attribute | Value | Notes |
---|---|---|
Type | TextField | |
Tokenizer | KeywordTokenizer | |
Lower-Caseing | yes | |
Stopwords | no | |
ASCII-Folding | yes | |
Normalization | no | |
Word Delimiter | yes | full functionality |
Phonetic Analysis | no | |
Synonyms | no | |
Leading Wildcards Support | yes |
Solr configuration:
<!-- general fieldType for email, urls and phone-numbers -->
<fieldType name="communication_address" className="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer className="solr.KeywordTokenizerFactory"/>
<filter className="solr.ASCIIFoldingFilterFactory"/>
<filter className="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" preserveOriginal="1"/>
<filter className="solr.LowerCaseFilterFactory"/>
<filter className="solr.ReversedWildcardFilterFactory" withOriginal="true" maxPosAsterisk="2" maxPosQuestion="1" minTrailing="2" maxFractionAsterisk="0"/>
<filter className="solr.FlattenGraphFilterFactory"/>
<filter className="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer className="solr.KeywordTokenizerFactory"/>
<filter className="solr.ASCIIFoldingFilterFactory"/>
<filter className="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1"/>
<filter className="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
DATE
Description:
Primitive field type for date values. Supports timestamps with millisecond precision in ISO format (yyyy-MM-dd'T'HH:mm:ss[.SSS]Z
). Storage is as a point field for fast range queries.
See also: Solr Working with Dates
Example:
Input: 2022-03-15T14:22:00Z
Stored value: 2022-03-15T14:22:00Z
Solr field type: pdate
, pdates
Content types: DATE
Properties:
Attribute | Value | Notes |
---|---|---|
Type | DatePointField | Primitive field |
Solr configuration:
<!-- KD-tree versions of date fields -->
<fieldType name="pdate" className="solr.DatePointField" docValues="true"/>
<fieldType name="pdates" className="solr.DatePointField" docValues="true" multiValued="true"/>
DOUBLE
Description:
Primitive field type for 64-bit floating point numbers (double
). Stored as a point field for efficient range and value queries.
Example:
Input: 3.14159
Stored value: 3.14159
Solr field type: pdouble
, pdoubles
Content types: NUMBER
Properties:
Attribute | Value | Notes |
---|---|---|
Type | DoublePointField | Primitive field |
Solr configuration:
<fieldType name="pdouble" className="solr.DoublePointField" docValues="true"/>
<fieldType name="pdoubles" className="solr.DoublePointField" docValues="true" multiValued="true"/>
EMAIL
Description:
Field type specifically for email addresses. Non-ASCII characters are normalized, special characters generate additional tokens. WordDelimiter splits on special characters except CamelCase.
Example:
Input: info@adito-online.de
Tokens: "info@adito-online.de"
, "info"
, "adito"
, "online"
, "de"
, "infoadito"
, "aditoonline"
, "onlinede"
, "infoaditoonlinede"
Solr field type: email_address
Content types: TEXT
, EMAIL
Properties:
Attribute | Value | Notes |
---|---|---|
Type | TextField | |
Tokenizer | WhitespaceTokenizer | |
Lower-Caseing | yes | |
Stopwords | no | |
ASCII-Folding | yes | |
Normalization | no | |
Word Delimiter | yes | except CamelCase |
Phonetic Analysis | no | |
Synonyms | no | |
Leading Wildcards Support | yes |
Solr configuration:
<fieldType name="email_address" className="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer className="solr.WhitespaceTokenizerFactory"/>
<filter className="solr.ASCIIFoldingFilterFactory"/>
<filter className="solr.LowerCaseFilterFactory"/>
<filter className="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="0" preserveOriginal="1"/>
<filter className="solr.ReversedWildcardFilterFactory" withOriginal="true" maxPosAsterisk="2" maxPosQuestion="1" minTrailing="2" maxFractionAsterisk="0"/>
<filter className="solr.FlattenGraphFilterFactory" />
<filter className="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer className="solr.WhitespaceTokenizerFactory"/>
<filter className="solr.ASCIIFoldingFilterFactory"/>
<filter className="solr.LowerCaseFilterFactory"/>
<filter className="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="0" preserveOriginal="1"/>
</analyzer>
</fieldType>
HTML
Description:
Field type for HTML content. Internally treated like TEXT_NO_STOPWORDS
, i.e., no stopword filtering and no special HTML analysis.
Example:
Input: <p>Willkommen bei ADITO!</p>
Tokens: "willkommen"
, "bei"
, "adito"
Solr field type: adito_text_nostopwords
Content types: TEXT
, HTML
Properties:
Attribute | Value | Notes |
---|---|---|
Type | TextField | |
Tokenizer | StandardTokenizer | |
Lower-Caseing | yes | |
Stopwords | no | |
ASCII-Folding | yes | |
Normalization | yes | German |
Word Delimiter | no | |
Phonetic Analysis | no | |
Synonyms | yes | Solr default |
Leading Wildcards Support | no |
Solr configuration:
see section TEXT_NO_STOPWORDS
INTEGER
Description:
Primitive field type for 32-bit signed integers (int
). Stored as a point field for efficient range queries.
Example:
Input: 42
Stored value: 42
Solr field type: pint
, pints
Content types: NUMBER
Properties:
Attribute | Value | Notes |
---|---|---|
Type | IntPointField | Primitive field |
Solr configuration:
<fieldType name="pint" className="solr.IntPointField" docValues="true"/>
<fieldType name="pints" className="solr.IntPointField" docValues="true" multiValued="true"/>
LOCATION
Description:
Primitive field type for geographic coordinates (latitude/longitude pairs, format: lat,lon
). Supports spatial search and distance calculation. Stored as a point field.
See also: Solr Spatial Search
Example:
Input: 48.123456,11.654321
Stored value: 48.123456,11.654321
Solr field type: location
Content types: TEXT
Properties:
Attribute | Value | Notes |
---|---|---|
Type | LatLonPointSpatialField | Primitive field |
Solr configuration:
<!-- A specialized field for geospatial search filters and distance sorting. -->
<fieldType name="location" className="solr.LatLonPointSpatialField" docValues="true"/>
LONG
Description:
Primitive field type for 64-bit signed integers (long
). Stored as a point field for efficient range queries.
Example:
Input: 12345678901234
Stored value: 12345678901234
Solr field type: plong
, plongs
Content types: NUMBER
, FILESIZE
, DATE
Properties:
Attribute | Value | Notes |
---|---|---|
Type | LongPointField | Primitive field |
Solr configuration:
<fieldType name="plong" className="solr.LongPointField" docValues="true"/>
<fieldType name="plongs" className="solr.LongPointField" docValues="true" multiValued="true"/>
LONG_TEXT
Description:
Field type for large text content such as PDFs with many pages or entire
books.
This field type behaves like the TEXT
type with three exceptions:
-
Stopwords are already filtered during indexing.
-
Separated words due to a line break are joined together again.
-
The
large
property prevents the contents from being loaded into the (Solr) cache.
IN: A new infographic about our logo should …
OUT: "neue"(2), "infografik"(3), "unser"(5), "logo"(6), "soll"(7),
Solr field type: adito_text_large
Content types: TEXT
, FILE
, HTML
Properties:
Attribute | Value | Notes |
---|---|---|
Type | TextField | |
Tokenizer | WhitespaceTokenizer | |
Lower-Caseing | YES | |
Stopwords | YES | |
ASCII-Folding | YES | |
Normalization | YES | |
Word Delimiter | YES | Only special characters |
Phonetic | NO | |
Synonyms | YES | Solr default → Internally empty |
Leading Wildcards Support | NO |
The large
attribute prevents the contents of the field from being loaded into the Solr cache. This increases search performance, as large texts, such as the entire content of a PDF, do not "clog" the cache.
However, fields with this attribute cannot be multiValued
!
Solr Schema
<fieldType name="adito_text_large" className="solr.TextField" positionIncrementGap="100" multiValued="true" large="true">
<analyzer type="index">
<tokenizer className="solr.WhitespaceTokenizerFactory"/>
<filter className="solr.HyphenatedWordsFilterFactory"/>
<filter className="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" preserveOriginal="0"/>
<filter className="solr.StopFilterFactory" ignoreCase="true" words="lang/adito/stopwords_mixed.txt" format="snowball"/>
<filter className="solr.GermanNormalizationFilterFactory"/>
<filter className="solr.ASCIIFoldingFilterFactory"/>
<filter className="solr.LowerCaseFilterFactory"/>
<filter className="solr.FlattenGraphFilterFactory"/>
<filter className="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer className="solr.StandardTokenizerFactory"/>
<filter className="solr.StopFilterFactory" ignoreCase="true" words="lang/adito/stopwords_mixed.txt" format="snowball"/>
<filter className="solr.GermanNormalizationFilterFactory"/>
<filter className="solr.ASCIIFoldingFilterFactory"/>
<filter className="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter className="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
PHONETIC_NAME
Description:
Field type for phonetic content such as personal names. This type uses
a phonetic filter (BeiderMorseFilter
) to analyze the content.
This enables matches for terms or names that sound similar,
e.g., "Meier", "Maier", and "Mayer".
IN: Tim Meier
OUT: "tim"(1), "tn"(1), "mDr"(2)
IN: Tim Maier
OUT: "tim"(1), "tn"(1), "mDr"(2)
Solr field type: phonetic_name
Content types: TEXT
Properties:
Attribute | Value | Notes |
---|---|---|
Type | TextField | |
Tokenizer | StandardTokenizer | |
Lower-Caseing | YES | |
Stopwords | NO | |
ASCII-Folding | NO | |
Normalization | NO | |
Word Delimiter | YES | CamelCase |
Phonetic | YES | BeiderMorse |
Synonyms | YES | Currently empty |
Leading Wildcards Support | YES |
Solr Schema
<fieldType name="phonetic_name" className="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer className="solr.StandardTokenizerFactory"/>
<filter className="solr.SynonymGraphFilterFactory" synonyms="lang/adito/pers_name_synonyms.txt" ignoreCase="true"/>
<filter className="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1"/>
<filter className="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true" languageSet="auto" />
<filter className="solr.ReversedWildcardFilterFactory" withOriginal="true" maxPosAsterisk="2" maxPosQuestion="1" minTrailing="2" maxFractionAsterisk="0"/>
<filter className="solr.FlattenGraphFilterFactory" />
<filter className="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer className="solr.StandardTokenizerFactory"/>
<filter className="solr.SynonymGraphFilterFactory" synonyms="lang/adito/pers_name_synonyms.txt" ignoreCase="true"/>
<filter className="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1"/>
<filter className="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true" languageSet="auto" />
</analyzer>
</fieldType>
PROPER NAME
Description:
Field type for proper names such as company names. The content is first normalized
(umlauts & non-ASCII characters) and then analyzed with a simple
phonetic filter (DoubleMetaphoneFilter
).
IN: quick-mix
OUT: "quickmix"(1), "KKMK"(1), "quick-mix"(1), "quick"(1), "KK"(1),
"mix"(1), "MKS"(1)
Solr field type: proper_name
Content types: TEXT
Properties:
Attribute | Value | Notes |
---|---|---|
Type | TextField | |
Tokenizer | StandardTokenizer | |
Lower-Caseing | YES | |
Stopwords | NO | |
ASCII-Folding | YES | |
Normalization | YES | |
Word Delimiter | YES | FULL |
Phonetic | YES | DoubleMetaphone |
Synonyms | NO | |
Leading Wildcards Support | YES |
Solr Schema
<fieldType name="proper_name" className="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer className="solr.WhitespaceTokenizerFactory"/>
<filter className="solr.GermanNormalizationFilterFactory"/>
<filter className="solr.ASCIIFoldingFilterFactory"/>
<filter className="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" preserveOriginal="1"/>
<filter className="solr.LowerCaseFilterFactory"/>
<filter className="solr.DoubleMetaphoneFilterFactory"/>
<filter className="solr.ReversedWildcardFilterFactory" withOriginal="true" maxPosAsterisk="2" maxPosQuestion="1" minTrailing="2" maxFractionAsterisk="0"/>
<filter className="solr.FlattenGraphFilterFactory" />
<filter className="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer className="solr.WhitespaceTokenizerFactory"/>
<filter className="solr.GermanNormalizationFilterFactory"/>
<filter className="solr.ASCIIFoldingFilterFactory"/>
<filter className="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1" preserveOriginal="1"/>
<filter className="solr.LowerCaseFilterFactory"/>
<filter className="solr.DoubleMetaphoneFilterFactory"/>
</analyzer>
</fieldType>
STRING
Description:
Primitive field type for short strings (UTF-8). No analysis or tokenization – the value is stored as-is. Suitable for fields up to ~32 KB.
Example:
Input: ADITO123
Stored value: ADITO123
Solr field type: string
, strings
Content types: TEXT
, BOOLEAN
, DATE
Properties:
Attribute | Value | Notes |
---|---|---|
Type | StrField | no analysis, stored as given |
Solr configuration:
<!-- The StrField type is not analyzed, but indexed/stored verbatim. -->
<fieldType name="string" className="solr.StrField" sortMissingLast="true" docValues="true" />
<fieldType name="strings" className="solr.StrField" sortMissingLast="true" multiValued="true" docValues="true" />
TELEPHONE
Description:
Field type optimized for phone numbers.
This type can handle area codes and +
signs. The digits of the number are concatenated (e.g.: +49 871 123456 → 0049871123456) and then additional sub-numbers (n-grams) are generated.
IN: +49 (8743) 9664-0
OUT: "004987439660"(1), "00498743966"(1), "0049874396"(1),
"004987439"(1), … "04987439660"(1), "4987439660"(1), "987439660"(1),
"87439660"(1), … "004"(1), "049"(1), "498"(1) … "966"(1), "660"(1)
Solr field type: phone_number
Content types: TEXT
, TELEPHONE
Properties:
Attribute | Value | Notes |
---|---|---|
Type | TextField | |
Tokenizer | KeywordTokenizer | |
Lower-Caseing | YES | |
Stopwords | NO | |
ASCII-Folding | NO | |
Normalization | NO | |
Word Delimiter | YES | Only for numbers |
Phonetic | NO | |
Synonyms | NO | |
Leading Wildcards Support | NO |
Solr Schema
<fieldType name="phone_number" className="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer className="solr.KeywordTokenizerFactory"/>
<filter className="solr.LowerCaseFilterFactory"/>
<filter className="solr.PatternReplaceFilterFactory" pattern="^[+]" replacement="00" replace="first"/>
<filter className="solr.PatternReplaceFilterFactory" pattern="^0([^0])" replacement="$1" replace="first"/>
<filter className="solr.PatternReplaceFilterFactory" pattern="\s" replacement="-"/>
<filter className="solr.WordDelimiterGraphFilterFactory" generateWordParts="0" generateNumberParts="1" catenateWords="0" catenateNumbers="1" catenateAll="1" splitOnCaseChange="0" preserveOriginal="0"/>
<filter className="solr.NGramFilterFactory" minGramSize="3" maxGramSize="30"/>
<filter className="solr.FlattenGraphFilterFactory" />
<filter className="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer className="solr.KeywordTokenizerFactory"/>
<filter className="solr.LowerCaseFilterFactory"/>
<filter className="solr.PatternReplaceFilterFactory" pattern="^[+]" replacement="00" replace="first"/>
<filter className="solr.PatternReplaceFilterFactory" pattern="^0([^0])" replacement="$1" replace="first"/>
<filter className="solr.PatternReplaceFilterFactory" pattern="\s" replacement="-"/>
<filter className="solr.WordDelimiterGraphFilterFactory" generateWordParts="0" generateNumberParts="1" catenateWords="0" catenateNumbers="1" catenateAll="1" splitOnCaseChange="0" preserveOriginal="0"/>
</analyzer>
</fieldType>
TEXT
Description:
Type for standard text.
The content is normalized (umlauts & non-ASCII characters). Terms with special characters and CamelCase are additionally split.
During search, stopwords are filtered; however, if the pattern only contains stopwords (e.g., 'AT', which is also a country code), the stopword filter is ignored.
Example: Indexing
IN: Eine neue Infografik über unser Logo soll …
OUT: "eine"(1), "neue"(2), "infografik"(3), "uber"(4), "unser"(5),
"logo"(6), "soll"(7),
Example: Searching
IN: Eine neue Infografik über unser Logo soll …
OUT: "neue"(2), "infografik"(3), "logo"(6), "soll"(7),
Example: Searching only stopwords
IN: Sein oder nicht sein
OUT: "sein"(1), "oder"(2), "nicht"(3), "sein"(4),
Solr field type: adito_text
Content types: TEXT
, FILE
, HTML
Properties:
Attribute | Value | Notes |
---|---|---|
Type | TextField | |
Tokenizer | StandardTokenizer | |
Lower-Caseing | YES | |
Stopwords | YES | |
ASCII-Folding | YES | |
Normalization | YES | German |
Word Delimiter | YES | only word splitting |
Phonetic | NO | |
Synonyms | YES | Solr default → Internally empty |
Leading Wildcards Support | NO |
Solr Schema
<!-- Default ADITO text field used by dynamic schema -->
<fieldType name="adito_text" className="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer className="solr.StandardTokenizerFactory"/>
<filter className="solr.GermanNormalizationFilterFactory"/>
<filter className="solr.ASCIIFoldingFilterFactory"/>
<filter className="solr.WordDelimiterGraphFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1"/>
<filter className="solr.LowerCaseFilterFactory"/>
<filter className="solr.FlattenGraphFilterFactory"/>
<filter className="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer className="solr.StandardTokenizerFactory"/>
<filter className="solr.SuggestStopFilterFactory" ignoreCase="true" words="lang/adito/stopwords_mixed.txt" format="snowball"/>
<filter className="solr.GermanNormalizationFilterFactory"/>
<filter className="solr.ASCIIFoldingFilterFactory"/>
<filter className="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter className="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
TEXT_NO_STOPWORDS
Description:
Standard text field (TEXT
) without stopword filtering.
Example
IN: Eine neue Infografik über unser Logo soll …
OUT: "eine"(1), "neue"(2), "infografik"(3), "uber"(4), "unser"(5),
"logo"(6), "soll"(7),
Solr field type: adito_text_nostopwords
Content types: TEXT
, FILE
, HTML
Properties:
Attribute | Value | Notes |
---|---|---|
Type | TextField | |
Tokenizer | StandardTokenizer | |
Lower-Caseing | YES | |
Stopwords | NO | |
ASCII-Folding | YES | |
Normalization | YES | German |
Word Delimiter | NO | |
Phonetic | NO | |
Synonyms | YES | Solr default → Internally empty |
Leading Wildcards Support | NO |
Solr Schema
<fieldType name="adito_text_nostopwords" className="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer className="solr.StandardTokenizerFactory"/>
<filter className="solr.GermanNormalizationFilterFactory"/>
<filter className="solr.ASCIIFoldingFilterFactory"/>
<filter className="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer className="solr.StandardTokenizerFactory"/>
<filter className="solr.GermanNormalizationFilterFactory"/>
<filter className="solr.ASCIIFoldingFilterFactory"/>
<filter className="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter className="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
TEXT_PLAIN
Description:
Field type for texts whose content should not be analyzed.
This type only eliminates punctuation and transforms the text to
lowercase.
This field type treats 'ä', 'ö', 'ü', and 'ß' as distinct characters.
Example
IN: Neue ADITO Schreibblöcke!
OUT "neue"(1) "adito"(2) "schreibblöcke"(3)
Solr field type: text_plain
Content types: TEXT
, FILE
, HTML
Properties:
Attribute | Value | Notes |
---|---|---|
Type | TextField | |
Tokenizer | StandardTokenizer | |
Lower-Caseing | YES | |
Stopwords | NO | |
ASCII-Folding | NO | |
Normalization | NO | |
Word Delimiter | NO | |
Phonetic | NO | |
Synonyms | YES | Solr default → Internally empty |
Leading Wildcards Support | NO |
Solr Schema
<fieldType name="text_plain" className="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer className="solr.StandardTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter className="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<filter className="solr.FlattenGraphFilterFactory"/>
-->
<filter className="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer className="solr.StandardTokenizerFactory"/>
<filter className="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter className="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter className="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Tokenizer
Tokenizers are responsible for splitting an input text into individual tokens. They operate at the character level and produce a TokenStream
, which is further processed by an analyzer. Unlike an analyzer, a tokenizer does not know the field context and only processes the raw format.
The following tokenizers are used in various ADITO field types:
WhitespaceTokenizer
Splits text exclusively at whitespace. Punctuation and special characters are retained.
Input:
"To be, or what?"
Token output:
"To"
, "be,"
, "or"
, "what?"
StandardTokenizer
Splits text at whitespace and most punctuation and special characters. Some characters, such as dots within domains or numeric formats, are not split. The @
character is a separator, so email addresses are fragmented.
Input:
"Please, email john.doe@foo.com by 03-09, re: m37-xq."
Token output:
"Please"
, "email"
, "john.doe"
, "foo.com"
, "by"
, "03"
, "09"
, "re"
, "m37"
, "xq"
KeywordTokenizer
Reads the entire input text as a single token. Used when no splitting should occur – e.g., for phone numbers, IDs, or strings to be stored exactly as entered.
Input:
"Please, email john.doe@foo.com by 03-09, re: m37-xq."
Token output:
"Please, email john.doe@foo.com by 03-09, re: m37-xq."
Filter
Filters process token streams after the tokenizer. They transform, discard, or expand the tokens depending on their function. The filter chain is crucial for the behavior of the field type.
LowerCaseFilter
Converts all letters in a token to lowercase. Other characters remain unchanged.
Example:
Input: "ADITO"
Output: "adito"
ASCIIFoldingFilter
Converts all non-ASCII characters to their ASCII equivalents – e.g., diacritics (umlauts, accents).
Example:
Input: "français, südlich"
Output: "francais"
, "sudlich"
GermanNormalizationFilter
Normalizes German umlauts, ß, and similar spelling variants. The filter is based on the German2 Snowball algorithm.
Transformations:
ä
,ae
→a
ö
,oe
→o
ü
,ue
→u
ß
→ss
WordDelimiterGraphFilter
Splits tokens at word and character boundaries. Typical splits occur at CamelCase, numeric transitions, or hyphens.
Example:
Input: "hotSpot-XL42"
Output: "hot"
, "Spot"
, "XL"
, "42"
, "hotSpot"
, "XL42"
, "hotSpotXL42"
Configurable via:
splitOnCaseChange
splitOnNumerics
preserveOriginal
catenateWords
/Numbers
/All
PhoneticFilter
Converts tokens into phonetic codes. Supported algorithms:
-
DoubleMetaphone
–DoubleMetaphoneFilter
– for proper names like "Meyer" / "Meier" -
RefinedSoundex
–PhoneticFilterFactory
– simple syllable encoding -
BeiderMorse
–BeiderMorseFilterFactory
– designed for personal/last names
– higher precision than Soundex
SynonymGraphFilter
Assigns defined synonyms to existing tokens. Enables semantically equivalent search queries. ADITO currently uses empty synonym lists. The feature is prepared but not active.
Example configuration:
<filter class="solr.SynonymGraphFilterFactory"
synonyms="mysynonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.FlattenGraphFilterFactory"/>
Synonym list mysynonyms.txt
:
couch,sofa,divan
teh => the
huge,ginormous,humungous => large
small => tiny,teeny,weeny
Input: "teh small couch"
Output: "the"
, "tiny"
, "teeny"
, "weeny"
, "couch"
, "sofa"
, "divan"
ReversedWildcardFilter
Enables efficient search queries with leading wildcards (*foo
). Tokens are indexed in reverse.
Input: "*bar"
Output: "rab*"
Tokens without wildcards remain unchanged.
StopFilter
Filters defined stopwords out of the token stream. ADITO uses a combined German-English stopword list (lang/adito/stopwords_mixed.txt
).
Example:
Input: "To be or what?"
Tokens before filter: "To"
, "be"
, "or"
, "what"
Tokens after filter: "what"
Stopwords
A list of German and English stopwords is used.
lang/adito/stopwords_mixed.txt
| NOTE: To use this file with StopFilterFactory, you must specify format="snowball"
| Comments begin with vertical bar. Each stop word is at the start of a line.
| German stop word list.
aber | but
alle | all allem allen aller alles
als | than, as also | so am | an + dem an | at
ander | other andere anderem anderen anderer anderes anderm andern anderr anders
auch | also auf | on aus | out of bei | by bin | am bis | until bist | art da | there damit | with it dann | then
der | the den des dem die das
daß | that
derselbe | the same derselben denselben desselben demselben dieselbe dieselben dasselbe
dazu | to that
dein | thy deine deinem deinen deiner deines
denn | because
derer | of those dessen | of him
dich | thee dir | to thee du | thou
dies | this diese diesem diesen dieser dieses
doch | (several meanings) dort | (over) there
durch | through
ein | a eine einem einen einer eines
einig | some einige einigem einigen einiger einiges
einmal | once
er | he ihn | him ihm | to him
es | it etwas | something
euer | your eure eurem euren eurer eures
für | for gegen | towards gewesen | p.p. of sein hab | have habe | have haben | have hat | has hatte | had hatten | had hier | here hin | there hinter | behind
ich | I mich | me mir | to me
ihr | you, to her ihre ihrem ihren ihrer ihres euch | to you
im | in + dem in | in indem | while ins | in + das ist | is
jede | each, every jedem jeden jeder jedes
jene | that jenem jenen jener jenes
jetzt | now kann | can
kein | no keine keinem keinen keiner keines
können | can könnte | could machen | do man | one
manche | some, many a manchem manchen mancher manches
mein | my meine meinem meinen meiner meines
mit | with muss | must musste | had to nach | to(wards) nicht | not nichts | nothing noch | still, yet nun | now nur | only ob | whether oder | or ohne | without sehr | very
sein | his seine seinem seinen seiner seines
selbst | self sich | herself
sie | they, she ihnen | to them
sind | are so | so
solche | such solchem solchen solcher solches
soll | shall sollte | should sondern | but sonst | else über | over um | about, around und | and
uns | us unse unsem unsen unser unses
unter | under viel | much vom | von + dem von | from vor | before während | while war | was waren | were warst | wast was | what weg | away, off weil | because weiter | further
welche | which welchem welchen welcher welches
wenn | when werde | will werden | will wie | how wieder | again will | want wir | we wird | will wirst | willst wo | where wollen | want wollte | wanted würde | would würden | would zu | to zum | zu + dem zur | zu + der zwar | indeed zwischen | between
| English stop word list
a an and are as at be but by for if in into is it no not of on or such that the their then there these they this to was will with