yacy_search_server

Commit Graph

Author	SHA1	Message	Date
Michael Peter Christen	765943a4b7	Redesign of crawler identification and robots steering. A non-p2p user in intranets and the internet can now choose to appear as Googlebot. This is an essential necessity to be able to compete in the field of commercial search appliances, since most web pages are these days optimized only for Google and no other search platform any more. All commercial search engine providers have a built-in fake-Google User Agent to be able to get the same search index as Google can do. Without the resistance against obeying to robots.txt in this case, no competition is possible any more. YaCy will always obey the robots.txt when it is used for crawling the web in a peer-to-peer network, but to establish a Search Appliance (like a Google Search Appliance, GSA) it is necessary to be able to behave exactly like a Google crawler. With this change, you will be able to switch the user agent when portal or intranet mode is selected on per-crawl-start basis. Every crawl start can have a different user agent.	11 years ago
Michael Peter Christen	57ffdfad4c	added a crawl option to obey html-meta-robots-noindex. This is on by default.	12 years ago
Michael Peter Christen	25499eead5	- added a new field for the regular expression in crawl start - added the field in crawl profile - adopted logging end error management - adopted duplicate document detection - added a new rule to the indexing process to reject non-matching content - full redesign of the expert crawl start servlet The new filter field can now be seen in /CrawlStartExpert_p.html at Section "Document Filter", subsection item "Filter on Content of Document"	12 years ago
Michael Peter Christen	0b6566a389	optimizations when starting large crawl requests with many start urls in one request: - allow larger match-fields in html interface - delete all host hashes at once from zurl - when deleting by host, do not count size of deleted entries since that was the reason it took so long	12 years ago
Michael Peter Christen	fb0fa9a102	- fixed 'delete from subpath' during crawl start which deleted nothing; now works; - changed some crawl start html design details	12 years ago
orbiter	b55ea2197f	- redesign of crawl start servlet - for domain-limited crawls, the domain is deleted now by default before the crawl is started	12 years ago
orbiter	1c66de4bd4	- removed scheduled crawling options in crawl start because it is superfluous there; it can be changed in the scheduler servlet. It's also confusing in the presence of the delete-option, which will be implemented next. - removed unused crawl start servlet - some refactoring to make the time parser reusable	12 years ago
Michael Peter Christen	5e77801aac	update to web interface structure	12 years ago
orbiter	354ef8000d	- added 'deleteold' option to crawler which causes that documents are deleted which are selected by a crawl filter (host or subpath) - site crawl used this option be default now - made option to deleteDomain() concurrency	12 years ago
Michael Peter Christen	ac9540dfb6	removed options for stopwords which are not used	12 years ago
orbiter	60b1e23f05	added new crawl options: - indexUrlMustMatch and indexUrlMustNotMatch which can be used to select loaded pages for indexing. Default patterns are in such a way that all loaded pages are also indexed (as before) but when doing an expert crawl start, then the user may select only specific urls to be indexed. - crawlerNoDepthLimitMatch is a new pattern that can be used to remove the crawl depth limitation. This filter a never-match by default (which causes that the depth is used) but the user can select paths which will be loaded completely even if a crawl depth is reached.	12 years ago
Michael Peter Christen	a13e5153ac	- added the possibility to have not one but a list of crawl start urls - the list of urls is entered in the expert crawl start in a textfield; the one-line input field was replaced with a text box - start urls can also be given in one single line where the urls are separated by a '\|'-character - as an effect, the crawl profile cannot carry a single start url for identificaton because it is possible to have more. Therefore the url was removed from the crawl profile - this affect all servlets which display a crawl profile: removed the url field from all there servlets - to work consistently with several start urls and the other crawl starts which computed crawl start url lists from sitelists or sitemaps, the crawl start servlet was restructured completely - new rules for must-match patterns were created to make it possible that site crawl starts also work with several crawl starts at once	12 years ago
Michael Peter Christen	b2b516cc3e	added a collection attribute to crawls and searches: - a solr field collection_sxt can be used to store a set of crawl tags - when this field is activated, a crawl tag can be assigned when crawls are started - the content of the collection field can be comma-separated, all of them are assigned to the documents when they are indexed as result of such a crawl start - a search result can be drilled down to a specific collection; this is currently only available in the solr interface and also in the gsa interface using the 'site' option - this adds a mandatory field for gsa queries (the google api demands that field all the time)	12 years ago
Michael Peter Christen	d7eb18cdf2	accept also file names beginning with "file://" for crawl start from file.	13 years ago
Michael Peter Christen	8bfc987374	enhanced hint how to enter file:// urls	13 years ago
orbiter	ebd840ebf6	- enhanced description on search front page - fixed language and heuristic modifier - added hint to crawl start that we can do also ftp and smb crawls - added a protocol extension to remote crawls to transport all search modifiers to remote peers git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8108 6c8d7289-2bf4-0310-a012-ef5d649a1542	13 years ago
orbiter	e4a82ddd8b	produce a bookmark entry from every crawl start. these bookmarks are always private. these bookmarks will be used to get a source reference for the search in case of intranet or portal searches. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8062 6c8d7289-2bf4-0310-a012-ef5d649a1542	13 years ago
orbiter	ff32469272	added a link to /api/util/getpageinfo_p.xml as API to crawl start info and to ViewFile.html git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8035 6c8d7289-2bf4-0310-a012-ef5d649a1542	13 years ago
low012	1b8b989744	*) set maxlength of input field for country code filter to value > default text length (old value caused warning in Opera) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8002 6c8d7289-2bf4-0310-a012-ef5d649a1542	13 years ago
orbiter	cf4fd525ee	added directDocByURL attribute in crawl profile git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7985 6c8d7289-2bf4-0310-a012-ef5d649a1542	13 years ago
orbiter	b250e6466d	implemented crawl restrictions for IP pattern and country lists git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7980 6c8d7289-2bf4-0310-a012-ef5d649a1542	13 years ago
orbiter	5ad7f9612b	added crawl settings for three new filters for each crawl: must-match for IPs (IPs that are known after DNS resolving for each URL in the crawl queue) must-not-match for IPs must-match against a list of country codes (allows only loading from hosts that are hostet in given countries) note: the settings and input environment is there with that commit, but the values are not yet evaluated git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7976 6c8d7289-2bf4-0310-a012-ef5d649a1542	13 years ago
orbiter	af63aa1d0e	added fresh links to java regular expression api-doc git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7763 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	7962d35425	- removed file upload function in crawl start and replaced it with an input field for a file path where the crawl start file is loaded. This was necessary to support the API steering for file crawl starts, for two reasons: 1) if the file is changed for a re-crawl this is not reflected in the steering because it would take the previously uploaded crawl start file 2) browsers do not submit the full path of the selected file even if this path is shown in the input field because of security reasons. There is no work-around or hack to make the submission of the full path possible - fixed deletion of crawl start point urls in crawl stack and balancer double-check - fixed a problem with steering self-call (no resolving of localhost) - added more logging for the crawler to supervise why crawl urls are not taken by the loader - added a javascript onload-function to select domain restriction in all cases where a crawl is started from a file or from a url - fixed the restrict-to-domain pattern computation, added a 'www.'-prefix and added this functionality also to a crawl start from file git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7574 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	11bebe356b	fixed crawl start: with SVN 7225 the name of the crawl start url was not given in input field and therefore all crawl starts had contained the empty string as crawl start url git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7229 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
mikeworks	70576e88d2	de.lng: Added some more untranslated strings I found and uncommented old ones that were removed terminal_p.html: Put back the old ID which was really easy to find IndexCreate.js: Because XHTML 1.0 Strict does not allow name tags for some elements rewrote most element access functions to use getElementById Table_API_p.html and all other html pages: Some XHTMl 1.0 Strict fixes, changed checkAll javascript, marked the first row with checkboxes as unsortable where applicable Table_API_p.java and all other java pages: URLencoded lines with possible ampersands & -> & for validation XHTML 1.0 Strict sourcecode --> All Index Create pages should validate now. Hope I did not break anything else (too much :-) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7225 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	f6eebb6f99	replaced auto-dom filter with easy-to-understand Site Link-List crawler option - nobody understand the auto-dom filter without a lenghtly introduction about the function of a crawler - nobody ever used the auto-dom filter other than with a crawl depth of 1 - the auto-dom filter was buggy since the filter did not survive a restart and then a search index contained waste - the function of the auto-dom filter was in fact to just load a link list from the given start url and then start separate crawls for all these urls restricted by their domain - the new Site Link-List option shows the target urls in real-time during input of the start url (like the robots check) and gives a transparent feed-back what it does before it can be used - the new option also fits into the easy site-crawl start menu git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7213 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
mikeworks	b019426811	de.lng: Added German translations for new Index Creation pages RSS Feeds and adapted text in Tables_p.html and CrawlStartExpert_p.html to match some typos, also changed one name tag to id to conform with XHTML 1.0 Strict git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7191 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	58b7417a59	- added a new 'easy' crawl start menu which can be used for the special case of loading a complete domain - the previous crawl start servet was renamed to CrawlStartExpert_p - easy crawl start is now default git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7160 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	2f381b8d7a	- fixed at least two causes for a NPE after a use case switch. A large refactoring was neccessary - added another crawl start option: automatic restriction to sub-path - removed crawlStartSimple and renamed crawl start expert to crawl start (without expert) - some changes to texts in crawl start - added some more deletions when an web index is deleted: delete also queues and robots cache git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4881 6c8d7289-2bf4-0310-a012-ef5d649a1542	17 years ago
lulabad	fc54d4519e	some more XHTML strict errors git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4471 6c8d7289-2bf4-0310-a012-ef5d649a1542	17 years ago
daburna	3636526bd6	replaced re-crawl/min-age as suggested here: http://forum.yacy-websuche.de/viewtopic.php?f=9&t=198 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4466 6c8d7289-2bf4-0310-a012-ef5d649a1542	17 years ago
daburna	a047e7f830	replaced irritating "re-crawl" git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4463 6c8d7289-2bf4-0310-a012-ef5d649a1542	17 years ago
orbiter	b183bf6f42	- fixed opensearch bugs - added 'full domain' button to expert crawl start - removed not-workin 'only one domain' button, the regex allowed crawling of other domains git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4125 6c8d7289-2bf4-0310-a012-ef5d649a1542	17 years ago
low012	51800539b2	*) changed regex that is created for crawling filter (see http://forum.yacy-websuche.de/viewtopic.php?t=83 ) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3945 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
orbiter	5009695537	fix for double-entries of crawl tasks. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3920 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
orbiter	c7a614830a	several bugfixes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3899 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
allo	b2a9080a14	fix for when the user hits cancel git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3820 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
allo	b68fb8a0ba	one \ more git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3819 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
allo	e24b54301e	RegEx, not Blacklist-style RegEx ;/ git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3818 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago
orbiter	3f49cd516b	splittet the index create page into two pages: - one with less option but with information about other remote crawls - one with complete information but without any other information on both pages the steering options had beed removed. They are now at the monitoring page. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@3813 6c8d7289-2bf4-0310-a012-ef5d649a1542	18 years ago

41 Commits (78e7aadb26ad38c30daa1a845b2d9cee3843c853)