yacy_search_server

Commit Graph

Author	SHA1	Message	Date
Michael Peter Christen	df3314ac1a	added a new facet type based on a probabilistic classifier using bayesian filters. This can be used to classify documents during indexing-time using a pre-definied bayesian filter. New wordings: - a context is a class where different categories are possible. The context name is equal to a facet name. - a category is a facet type within a facet navigation. Each context must have several categories, at least one custom name (things you want to discover) and one with the exact name "negative". To use this, you must do: - for each context, you must create a directory within DATA/CLASSIFICATION with the name of the context (the facet name) - within each context directory, you must create text files with one document each per line for every categroy. One of these categories MUST have the name 'negative.txt'. Then, each new document is classified to match within one of the given categories for each context.	9 years ago
reger	1d8e1e4bac	- Image search expand box, adjust javascript hs padtominsize parameter, to make sure expand box doesn't shrink on small images - asure ImageResult.imagetext has value for the link text (use filename if no alt text given)	10 years ago
Michael Peter Christen	535f1ebe3b	added a new way of content browsing in search results: - date navigation The date is taken from the CONTENT of the documents / web pages, NOT from a date submitted in the context of metadata (i.e. http header or html head form). This makes it possible to search for documents in the future, i.e. when documents contain event descriptions for future events. The date is written to an index field which is now enabled by default. All documents are scanned for contained date mentions. To visualize the dates for a specific search results, a histogram showing the number of documents for each day is displayed. To render these histograms the morris.js library is used. Morris.js requires also raphael.js which is now also integrated in YaCy. The histogram is now also displayed in the index browser by default. To select a specific range from a search result, the following modifiers had been introduced: from:<date> to:<date> These modifiers can be used separately (i.e. only 'from' or only 'to') to describe an open interval or combined to have a closed interval. Both dates are inclusive. To select a specific single date only, use the 'to:' - modifier. The histogram shows blue and green lines; the green lines denot weekend days (saturday and sunday). Clicking on bars in the histogram has the following reaction: 1st click: add a from:<date> modifier for the date of the bar 2nd click: add a to:<date> modifier for the date of the bar 3rd click: remove from and date modifier and set a on:<date> for the bar When the on:<date> modifier is used, the histogram shows an unlimited time period. This makes it possible to click again (4th click) which is then interpreted as a 1st click again (sets a from modifier). The display feature is NOT switched on by default; to switch it on use the /ConfigSearchPage_p.html servlet.	10 years ago
Michael Peter Christen	d9603039ff	automatically set the Q flag for smb/ftp start urls (split pdf support)	10 years ago
Ryszard Goń	3144313974	Postprocessing progress bar fix (Make it work as [probably] actually intended)	10 years ago
Michael Peter Christen	9fce8bf2a5	crawling of multi-page pdfs with artificial post part on smb or ftp shares is not possible with the disabled setting; this is not temporary disabled until a better solution is on the hand.	10 years ago
reger	b0c87d8240	fix image search expand box, cut-off of 2nd capture line height tested with IE11 and Firefox 32 (change worked for both to show 2nd line without cutting off height) +fix charset parameter in metadataImageParser +update start errMsgTxt to "java 1.7"	10 years ago
orbiter	4177c9cf05	fix for crawl start check	11 years ago
Michael Peter Christen	362c988c05	design fixes to better use the new colours	11 years ago
Michael Peter Christen	bd886054cb	new structure and enhancements for link graph computation: - added order option to solr queries to be able to retrieve document lists in specific order, here: link length - added HyperlinkEdge class which manages the link structure - integrated the HyperlinkEdge class into clickdepth computation - extended the linkstructure.json servlet to show also the clickdepth and other statistic information	11 years ago
Michael Peter Christen	e8ddd415a8	enhanced the new link structure graph	11 years ago
Michael Peter Christen	a6bb9be97e	- added d3.js for visualizations using embedded svg - added a servlet api/linkstructure.json which generates a link graph information in json - added a javascript link graph renderer hypertree.js using d3 and the new servlet linkstructure.json - embedded the new link graph in the crawler monitor and the host browser	11 years ago
Michael Peter Christen	721178dc84	misc style bugfixes	11 years ago
Michael Peter Christen	f0f22e68bb	fix for page navigation bar	11 years ago
Michael Peter Christen	deae992d47	fixes to progess bar	11 years ago
Michael Peter Christen	617dd9c97b	- added new input field in index.html - changed progress bar in yacysearch.html - moved pagination navigation to page bottom - moved search term input field to headline	11 years ago
Michael Peter Christen	ed7ad2ef0a	replaced old navbar with bootstrap pagination	11 years ago
Michael Peter Christen	1245cfeb43	small change to crawler monitor to fit in larger translations	11 years ago
Michael Peter Christen	9e0e39a9a4	small change to start/stop/pause icon style	11 years ago
orbiter	4035e20f0b	unescaping the path	11 years ago
Michael Peter Christen	81926c055d	fixed bug with image search in yacyinteractive	11 years ago
orbiter	19a051bec8	more monitoring for postprocessing and enhanced layout in Crawler monitor page	11 years ago
Michael Peter Christen	fceac8cffd	more monitoring for postprocessing	11 years ago
orbiter	9c681cc00d	added segment sizes, postprocessing status and cpu load to crawler monitor	11 years ago
Roland Haeder	ebbb3bc5c1	Fixed CHMOD on many files + added missing loggers (e.g. jena) and made some noisy loggers quiet	12 years ago
Frank	7763f2554f	add the new PPMbar in Crawler_p for a better style and better use.	12 years ago
orbiter	7ff10bdb1b	fix of page navigation for formatted totalcount numbers	12 years ago
Michael Peter Christen	c95a84103a	complete redesign of search process: - removed 'worker' processes - no internal time-out behaviour: methods either are successful or return null - waiting is only done on top-level - removed snippet-production; this is replaced by solr snippets - removed statistics based on solr size queries (they had been VERY long); the statistics (like suggestions or tag cloud) are now again based on the old but very fast RWI index. In portal or intranet mode the RWI index is usually switched off; if you like to have statistics again then you must switch on the rwis again in this mode. - fixed many bugs regarding correct page counter	12 years ago
Michael Peter Christen	788288eb9e	added the generation of 50 (!!) new solr field in the core 'webgraph'. The default schema uses only some of them and the resting search index has now the following properties: - webgraph size will have about 40 times as much entries as default index - the complete index size will increase and may be about the double size of current amount As testing showed, not much indexing performance is lost. The default index will be smaller (moved fields out of it); thus searching can be faster. The new index will cause that some old parts in YaCy can be removed, i.e. specialized webgraph data and the noload crawler. The new index will make it possible to: - search within link texts of linked but not indexed documents (about 20 times of document index in size!!) - get a very detailed link graph - enhance ranking using a complete link graph To get the full access to the new index, the API to solr has now two access points: one with attribute core=collection1 for the default search index and core=webgraph to the new webgraph search index. This is also avaiable for p2p operation but client access is not yet implemented.	12 years ago
orbiter	594ed63f2a	fixed interactive search which caused an error if pubDate is not present in a search result	12 years ago
Michael Peter Christen	de58043205	Added image license generation for solr image search results when results are generated within yjson result writer. This makes it possible to view images in yacyinteractive from solr.	12 years ago
Michael Peter Christen	02fa31b5bf	better filesearch layout	12 years ago
Michael Peter Christen	e55ec3071d	reduced number of facets in yacyinteractive (only filetype necessary)	12 years ago
Michael Peter Christen	c34af7fe94	extended JSON Response Writer and Opensearch Response Writer for the Solr search interface in such way that it is possible to use this interface for the yacyinteractive search. This search interface is now much faster using the Solr search directly. For the Solr interface it was necessary to create a translation from the YaCy search modifiers to the Solr facet selection. This was added in such a way that it becomes generic for the normal YaCy search and as a on-top evaluation for Solr queries.	12 years ago
Michael Peter Christen	e1f89efd0d	- made image search in interactive search using the ViewImage servlet - that enables viewing of images for intranet SMB servers. - added a filter search for protocol, tld and ext again; otherwise p2p search produces a lot of rubbish	12 years ago
Michael Peter Christen	7ad5457db0	using the solr facets as navigation in yacyinteractive.html instead of counting locally result types	12 years ago
Michael Peter Christen	b7004043ea	- added a field cache for solr queries which call only for a single value - fixed a version conflict exception within a solr add request	12 years ago
Michael Peter Christen	86ec199126	using a better file name	12 years ago
apfelmaennchen	d31a632951	- added dmoz RDF dump importer - added indexing to Tables columns to support larger bookmark collections - added RDF output (HTTP) for public bookmarks at /YMarks.rdf - YMarkRDF also provides a Jena RDF Model as "internal" API - various other changes/fixes for YMarks (mainly backend)	12 years ago
Michael Peter Christen	6fc5400f91	added a tooltip for search navigation to mention that search pages can be navigated using the TAB key	12 years ago
sixcooler	f64e78497a	fix for reload-feature in Crawler_p	13 years ago
cominch	a120ef660b	RDF demo servlet	13 years ago
Michael Peter Christen	638390930d	another patch to fix the Crawler_p layout	13 years ago
Michael Peter Christen	c846e9ca14	redesign of the crawler monitor page: show crawled pages instead of queue of urls that shall be crawled	13 years ago
Michael Peter Christen	08dcf3e5d1	hack to get all results if the actual number is between 10 and 64	13 years ago
Michael Peter Christen	f8cd57c92f	new indexing strategy: ALL links that appear anywhere are indexed, not only links where the content can be parsed. All non-parseable links are placed into the noload queue. The search process must therefore be able to filter out non-text search results. - This fixes the problem that image search results appeared in the text search. - The interactive search can retrieve now ALL types of links - The p2p interface is now extended to retrieve only certain types of links (text, image, video, apps) - The search process has an extension to filter the right document type according to the search query	13 years ago
Michael Peter Christen	fa7b3481b3	better navigation in file search: less results by first try, but much faster. after the first search is done, buttons appear to get more results for the same search	13 years ago
Michael Peter Christen	6e51a00a2f	Revert "fix for page navigation: show only as much pages as are available for given navigation constraints, not as given by total results size" This reverts commit `73f5a9e8b3`.	13 years ago
Michael Peter Christen	73f5a9e8b3	fix for page navigation: show only as much pages as are available for given navigation constraints, not as given by total results size	13 years ago
Michael Peter Christen	9ad1d8dde2	complete redesign of crawl queue monitoring: do not look at a ready-prepared crawl list but at the stacks of the domains that are stored for balanced crawling. This affects also the balancer since that does not need to prepare the pre-selected crawl list for monitoring. As a effect: - it is no more possible to see the correct order of next to-be-crawled links, since that depends on the actual state of the balancer stack the next time another url is requested for loading - the balancer works better since the next url can be selected according to the current situation and not according to a pre-selected order.	13 years ago

1 2 3 4 5

208 Commits (b92d81b07355b4040206d689c232dee0e8fb89ca)