You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
yacy_search_server/defaults/solr.webgraph.schema

173 lines
4.8 KiB

introduced a second core named 'webgraph'. This core will hold the link structure, but is not filled yet. To have the opportunity of a second core, multi-core functionality had to be implemented to the deep-embedded solr: - migrated the solr_40 directory content to a subdirectory 'collection1'; the previously used default core is now called collection1 - added solr_40/webgraph subdirectory as second core - added a servlet configuration for the second core 'webgraph' in /IndexSchema_p.html - added instance handling as addition to solr connections: all solr connectors are now instances of an solr 'instance' object; this required a complete re-design of the solr embedding - migrated also caching and sharding ontop of new instance handling - migrated the search apis to handle now the access to a specific core, the default core named 'collection1' - migrated the remote solr search interface to access shards of cores; for the yacy remote search the default core is now called 'solr'; using the peer address as solr address - migrated the solr backup and restore process: old backups cannot be used after this migration! - redesign of solr instance handling in all methods which access the instances: they cannot hold copies of these instances any more; the must retrieve the actuall connection object every time they want to write to it (this solves also some bugs when switching the index/network) - added another schema 'solr.webgraph.schema', the old solr.keys.list is replaced by solr.collection.schema
12 years ago
## this is a list of all solr keys for the webgraph index 'webgraph', an index of all links
## this complete list of keys can be changed; the actual schema is stored in:
## DATA/SETTINGS/solr.webgraph.schema
## the syntax of this file:
## - all lines beginning with '##' are comments
## - all non-empty lines not beginning with '#' are keyword lines
## - all lines beginning with '#' and where the second character is not '#' are commented-out keyword lines
##
## document organization
##
## primary key of document, a combination of <source-url-hash><target-url-hash><four-digit-hex-counter> (28 characters)
id
## tags that are attached to crawls/index generation to separate the search result into user-defined subsets
collection_sxt
##
## url construction information about the source
##
## primary key of document, the URL hash (source)
source_id_s
## the url of the document (source)
#source_url_s
## the file name extension (source)
#source_file_ext_s
## normalized (absolute URLs), as <a> - tag with anchor text and nofollow (source)
#source_tag_s
## number of all characters in the url (source)
#source_chars_i
## the protocol of the url (source)
#source_protocol_s
## path of the url (source)
#source_path_s
## count of all path elements in the url (source)
#source_path_folders_count_i
## all path elements in the url (source)
#source_path_folders_sxt
## number of key-value pairs in search part of the url (source)
#source_parameter_count_i
## the keys from key-value pairs in the search part of the url (source)
#source_parameter_key_sxt
## the values from key-value pairs in the search part of the url (source)
#source_parameter_value_sxt
## depth of web page according to number of clicks from the 'main' page, which is the page that appears if only the host is entered as url (source)
#source_clickdepth_i
## host of the url
#source_host_s
## the Domain Class Name, either the TLD or a combination of ccSLD+TLD if a ccSLD is used (source)
#source_host_dnc_s
## either the second level domain or, if a ccSLD is used, the third level domain
#source_host_organization_s
## the organization and dnc concatenated with '.' (source)
#source_host_organizationdnc_s
## the remaining part of the host without organizationdnc (source)
#source_host_subdomain_s
##
## Information in the source about the target
##
## the text content of the a-tag (in source, but pointing to a target)
target_linktext_t
## the length of the a-tag content text as number of characters (in source, but pointing to a target)
#target_linktext_charcount_i
## the length of the a-tag content text as number of words (in source, but pointing to a target)
#target_linktext_wordcount_i
## if the link is an image link, this contains the alt tag if the image is also liked as img link (in source, but pointing to a target)
target_alt_t
## the length of the a-tag content text as number of characters (in source, but pointing to a target)
#target_alt_charcount_i
## the length of the a-tag content text as number of words (in source, but pointing to a target)
#target_alt_wordcount_i
## the name property of the a-tag (in source, but pointing to a target)
target_name_t
## the rel property of the a-tag (in source, but pointing to a target)
#target_rel_s
## the rel property of the a-tag, coded binary (in source, but pointing to a target)
#target_relflags_i
##
## url construction information about the target
##
## primary key of document, the URL hash (target)
target_id_s
## the url of the document (target)
target_url_s
## the file name extension (target)
target_file_ext_s
## normalized (absolute URLs), as <a> - tag with anchor text and nofollow (target)
#target_tag_s
## number of all characters in the url (target)
#target_chars_i
## the protocol of the url (target)
target_protocol_s
## path of the url (target)
#target_path_s
## count of all path elements in the url (target)
#target_path_folders_count_i
## all path elements in the url (target)
target_path_folders_sxt
## number of key-value pairs in search part of the url (target)
#target_parameter_count_i
## the keys from key-value pairs in the search part of the url (target)
#target_parameter_key_sxt
## the values from key-value pairs in the search part of the url (target)
#target_parameter_value_sxt
## "depth of web page according to number of clicks from the 'main' page, which is the page that appears if only the host is entered as url (target)
#target_clickdepth_i
## host of the url (target)
#target_host_s
## the Domain Class Name, either the TLD or a combination of ccSLD+TLD if a ccSLD is used (target)
#target_host_dnc_s
## either the second level domain or, if a ccSLD is used, the third level domain (target)
#target_host_organization_s
## the organization and dnc concatenated with '.' (target)
#target_host_organizationdnc_s
## the remaining part of the host without organizationdnc (target)
#target_host_subdomain_s