You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
418 lines
10 KiB
418 lines
10 KiB
## this is a list of all solr keys
|
|
## solr can be used as alternative index target, solr is NOT the primary indexing system of YaCy
|
|
## this complete list of keys can be reduced:
|
|
## reduced list of keys can be placed in DATA/SETTINGS/solr.keys.<profile>.list
|
|
## where they can be used as profiles for solr index transport
|
|
|
|
## the syntax of this file:
|
|
## - all lines beginning with '##' are comments
|
|
## - all non-empty lines not beginning with '#' are keyword lines
|
|
## - all lines beginning with '#' and where the second character is not '#' are commented-out keyword lines
|
|
|
|
### mandatory values, do not disable them, YaCy won't work without them
|
|
|
|
## primary key of document, the URL hash, string (mandatory field)
|
|
id
|
|
|
|
##url of document, string (mandatory field)
|
|
sku
|
|
|
|
## last-modified from http header, date (mandatory field)
|
|
last_modified
|
|
|
|
## mime-type of document, string (mandatory field)
|
|
content_type
|
|
|
|
## content of title tag, text (mandatory field)
|
|
title
|
|
|
|
## id of the host, a 6-byte hash that is part of the document id (mandatory field)
|
|
host_id_s
|
|
|
|
## the md5 of the raw source (mandatory field)
|
|
md5_s
|
|
|
|
## the size of the raw source (mandatory field)
|
|
size_i
|
|
|
|
## index creation comment (mandatory field)
|
|
process_s
|
|
|
|
## fail reason if a page was not loaded. if the page was loaded then this field is empty, text (mandatory field)
|
|
failreason_t
|
|
|
|
## html status return code (i.e. "200" for ok), -1 if not loaded (see content of failreason_t for this case), int (mandatory field)
|
|
httpstatus_i
|
|
|
|
## redirect url if the error code is 299 < httpstatus_i < 310
|
|
#httpstatus_redirect_s
|
|
|
|
|
|
### optional but highly recommended values, part of the index distribution process
|
|
|
|
## time when resource was loaded
|
|
load_date_dt
|
|
|
|
## date until resource shall be considered as fresh
|
|
fresh_date_dt
|
|
|
|
## ids of referrer to this document
|
|
referrer_id_txt
|
|
|
|
## the name of the publisher of the document
|
|
publisher_t
|
|
|
|
## the language used in the document
|
|
language_s
|
|
|
|
## number of links to audio resources
|
|
audiolinkscount_i
|
|
|
|
## number of links to video resources
|
|
videolinkscount_i
|
|
|
|
## number of links to application resources
|
|
applinkscount_i
|
|
|
|
|
|
### optional but highly recommended values, not part of the index distribution process
|
|
|
|
## tags that are attached to crawls/index generation to separate the search result into user-defined subsets
|
|
collection_sxt
|
|
|
|
## point in degrees of latitude,longitude as declared in WSG84, location
|
|
coordinate_p
|
|
|
|
## content of author-tag, texgen
|
|
author
|
|
|
|
## content of description-tag, text
|
|
description
|
|
|
|
## content of keywords tag; words are separated by space
|
|
keywords
|
|
|
|
## character encoding, string
|
|
charset_s
|
|
|
|
## number of words in visible area, int
|
|
wordcount_i
|
|
|
|
## total number of inbound links, int
|
|
inboundlinkscount_i
|
|
|
|
## number of inbound links with nofollow tag, int
|
|
inboundlinksnofollowcount_i
|
|
|
|
## external number of inbound links, int
|
|
outboundlinkscount_i
|
|
|
|
## number of external links with nofollow tag, int
|
|
outboundlinksnofollowcount_i
|
|
|
|
## number of images, int
|
|
imagescount_i
|
|
|
|
## response time of target server in milliseconds, int
|
|
responsetime_i
|
|
|
|
## all visible text, text
|
|
text_t
|
|
|
|
## additional synonyms to the words in the text
|
|
synonyms_sxt
|
|
|
|
## h1 header
|
|
h1_txt
|
|
|
|
## h2 header
|
|
h2_txt
|
|
|
|
## h3 header
|
|
h3_txt
|
|
|
|
## h4 header
|
|
h4_txt
|
|
|
|
## h5 header
|
|
h5_txt
|
|
|
|
## h6 header
|
|
h6_txt
|
|
|
|
|
|
### optional values, not part of standard YaCy handling (but useful for external applications)
|
|
|
|
## ip of host of url (after DNS lookup), string
|
|
#ip_s
|
|
|
|
## tags of css entries, normalized with absolute URL
|
|
#css_tag_txt
|
|
|
|
## urls of css entries, normalized with absolute URL
|
|
#css_url_txt
|
|
|
|
## number of css entries, int
|
|
#csscount_i
|
|
|
|
## urls of script entries, normalized with absolute URL
|
|
#scripts_txt
|
|
|
|
## number of script entries, int
|
|
#scriptscount_i
|
|
|
|
## encoded as binary value into an integer:
|
|
## bit 0: "all" contained in html header meta
|
|
## bit 1: "index" contained in html header meta
|
|
## bit 2: "noindex" contained in html header meta
|
|
## bit 3: "nofollow" contained in html header meta
|
|
## bit 8: "noarchive" contained in http header properties
|
|
## bit 9: "nosnippet" contained in http header properties
|
|
## bit 10: "noindex" contained in http header properties
|
|
## bit 11: "nofollow" contained in http header properties
|
|
## bit 12: "unavailable_after" contained in http header properties
|
|
## content of <meta name="robots" content=#content#> tag and the "X-Robots-Tag" HTTP property
|
|
#robots_i
|
|
|
|
## content of <meta name="generator" content=#content#> tag, text
|
|
#metagenerator_t
|
|
|
|
## internal links, normalized (absolute URLs), as <a> - tag with anchor text and nofollow
|
|
#inboundlinks_tag_txt
|
|
|
|
## internal links, only the protocol
|
|
inboundlinks_protocol_sxt
|
|
|
|
## internal links, the url only without the protocol
|
|
inboundlinks_urlstub_txt
|
|
|
|
## internal links, the name property of the a-tag
|
|
#inboundlinks_name_txt
|
|
|
|
## internal links, the rel property of the a-tag
|
|
#inboundlinks_rel_sxt
|
|
|
|
## internal links, the rel property of the a-tag, coded binary
|
|
#inboundlinks_relflags_val
|
|
|
|
## internal links, the text content of the a-tag
|
|
#inboundlinks_text_txt
|
|
|
|
## internal links, the length of the a-tag as number of characters
|
|
#inboundlinks_text_chars_val
|
|
|
|
## internal links, the length of the a-tag as number of words
|
|
#inboundlinks_text_words_val
|
|
|
|
##if the link is an image link, this contains the alt tag if the image is also liked as img link
|
|
#inboundlinks_alttag_txt
|
|
|
|
## external links, normalized (absolute URLs), as <a> - tag with anchor text and nofollow
|
|
#outboundlinks_tag_txt
|
|
|
|
## external links, only the protocol
|
|
outboundlinks_protocol_sxt
|
|
|
|
## external links, the url only without the protocol
|
|
outboundlinks_urlstub_txt
|
|
|
|
## external links, the name property of the a-tag
|
|
#outboundlinks_name_txt
|
|
|
|
## external links, the rel property of the a-tag
|
|
#outboundlinks_rel_sxt
|
|
|
|
## external links, the rel property of the a-tag, coded binary
|
|
#outboundlinks_relflags_val
|
|
|
|
## external links, the text content of the a-tag
|
|
#outboundlinks_text_txt
|
|
|
|
## external links, the length of the a-tag as number of characters
|
|
#outboundlinks_text_chars_val
|
|
|
|
## external links, the length of the a-tag as number of words
|
|
#outboundlinks_text_words_val
|
|
|
|
##if the link is an image link, this contains the alt tag if the image is also liked as img link
|
|
#outboundlinks_alttag_txt
|
|
|
|
## all image tags, encoded as <img> tag inclusive alt- and title property
|
|
#images_tag_txt
|
|
|
|
## all image links without the protocol and '://'
|
|
#images_urlstub_txt
|
|
|
|
## all image link protocols
|
|
#images_protocol_sxt
|
|
|
|
## all image link alt tag
|
|
#images_alt_txt
|
|
|
|
## number of image links with alt tag
|
|
#images_withalt_i
|
|
|
|
## binary pattern for the existance of h1..h6 headlines, int
|
|
#htags_i
|
|
|
|
## url inside the canonical link element, string
|
|
#canonical_t
|
|
|
|
## link from the url property inside the refresh link element, string
|
|
#refresh_s
|
|
|
|
## all texts in <li> tags
|
|
#li_txt
|
|
|
|
## number of <li> tags, int
|
|
#licount_i
|
|
|
|
## all texts inside of <b> or <strong> tags. no doubles. listed in the order of number of occurrences in decreasing order
|
|
bold_txt
|
|
|
|
## number of occurrences of texts in bold_txt
|
|
#bold_val
|
|
|
|
## total number of occurrences of <b> or <strong>, int
|
|
#boldcount_i
|
|
|
|
## all texts inside of <i> tags. no doubles. listed in the order of number of occurrences in decreasing order
|
|
italic_txt
|
|
|
|
## number of occurrences of texts in italic_txt
|
|
#italic_val
|
|
|
|
## total number of occurrences of <i>, int
|
|
#italiccount_i
|
|
|
|
## all texts inside of <u> tags. no doubles. listed in the order of number of occurrences in decreasing order
|
|
underline_txt
|
|
|
|
## number of occurrences of texts in underline_txt
|
|
#underline_val
|
|
|
|
## total number of occurrences of <u>, int
|
|
#underlinecount_i
|
|
|
|
## flag that shows if a swf file is linked, boolean
|
|
#flash_b
|
|
|
|
## list of all links to frames
|
|
#frames_txt
|
|
|
|
## number of attr_frames, int
|
|
#framesscount_i
|
|
|
|
## list of all links to iframes
|
|
#iframes_txt
|
|
|
|
## number of attr_iframes, int
|
|
#iframesscount_i
|
|
|
|
## the protocol of the url
|
|
url_protocol_s
|
|
|
|
## all path elements in the url
|
|
url_paths_sxt
|
|
|
|
## the file name extension
|
|
url_file_ext_s
|
|
|
|
## number of key-value pairs in search part of the url
|
|
#url_parameter_i
|
|
|
|
## the keys from key-value pairs in the search part of the url
|
|
#url_parameter_key_sxt
|
|
|
|
## the values from key-value pairs in the search part of the url
|
|
#url_parameter_value_sxt
|
|
|
|
## number of all characters in the url == length of sku field
|
|
#url_chars_i
|
|
|
|
## host of the url, string
|
|
host_s
|
|
|
|
## the Domain Class Name, either the TLD or a combination of ccSLD+TLD if a ccSLD is used.
|
|
#host_dnc_s
|
|
|
|
## either the second level domain or, if a ccSLD is used, the third level domain
|
|
host_organization_s
|
|
|
|
## the organization and dnc concatenated with '.'
|
|
#host_organizationdnc_s
|
|
|
|
## the remaining part of the host without organizationdnc
|
|
#host_subdomain_s
|
|
|
|
## number of titles (counting the 'title' field) in the document
|
|
#title_count_i
|
|
|
|
## number of characters for each title
|
|
#title_chars_val
|
|
|
|
## number of words in each title
|
|
#title_words_val
|
|
|
|
## number of descriptions in the document. Its not counting the 'description' field since there is only one. But it counts the number of descriptions that appear in the document (if any)
|
|
#description_count_i
|
|
|
|
## number of characters for each description
|
|
#description_chars_val
|
|
|
|
## number of words in each description
|
|
#description_words_val
|
|
|
|
## number of h1..h6 header lines
|
|
#h1_i
|
|
#h2_i
|
|
#h3_i
|
|
#h4_i
|
|
#h5_i
|
|
#h6_i
|
|
|
|
## breadcrumbs, see http://schema.org/WebPage; this is a counter how many itemprop="breadcrumb" properties in div tags appears within a page
|
|
#schema_org_breadcrumb_i
|
|
|
|
## Open Graph Metadata field, see http://ogp.me/ns#
|
|
#opengraph_title_t
|
|
#opengraph_type_s
|
|
#opengraph_url_s
|
|
#opengraph_image_s
|
|
|
|
## names of cms attributes; if several are recognized then they are listen in decreasing order of number of matching criterias
|
|
#ext_cms_txt
|
|
|
|
## number of attributes that count for a specific cms in attr_cms
|
|
#ext_cms_val
|
|
|
|
## names of ad-servers/ad-services
|
|
#ext_ads_txt
|
|
|
|
## number of attributes counts in attr_ads
|
|
#ext_ads_val
|
|
|
|
## names of recognized community functions
|
|
#ext_community_txt
|
|
|
|
## number of attribute counts in attr_community
|
|
#ext_community_val
|
|
|
|
## names of map services
|
|
#ext_maps_txt
|
|
|
|
## number of attribute counts in attr_maps
|
|
#ext_maps_val
|
|
|
|
## names of tracker server
|
|
#ext_tracker_txt
|
|
|
|
## number of attribute counts in attr_tracker
|
|
#ext_tracker_val
|
|
|
|
## names matching title expressions
|
|
#ext_title_txt
|
|
|
|
## number of matching title expressions
|
|
#ext_title_val
|