yacy_search_server

Commit Graph

Author	SHA1	Message	Date
Roland Haeder	ebbb3bc5c1	Fixed CHMOD on many files + added missing loggers (e.g. jena) and made some noisy loggers quiet	12 years ago
Michael Peter Christen	bcc623a843	refactoring of load_delay: this is a matter of client identification	12 years ago
orbiter	2be456e7fb	added a postprocessing field into api/status_p.xml to show if the postprocessing task is running at that time (status: busy) or not (status:idle)	12 years ago
orbiter	c4efb612e2	added list of crawls to status_p.xml	12 years ago
orbiter	dac88561ae	minimum access time has a tight connection to ClientIdentification, therefore it is defined there.	12 years ago
Michael Peter Christen	5878c1d599	- refactoring of log to ConcurrentLog: jdk-based logger tend to block at java.util.logging.Logger.log(Logger.java:476) in concurrent environments. This makes logging a main performance issue. To overcome this problem, this is a add-on to jdk logging to put log entries on a concurrent message queue and log the messages one by one using a separate process. - FTPClient uses the concurrent logging instead of the log4j logger	12 years ago
orbiter	c8e94ad7c7	fix for citation search in case that the citation is very fresh	12 years ago
Michael Peter Christen	fd1776a3b0	added a new 'Citations' function: each search result item can now be explored for citations within other documents. A click on the 'Citations' link shows an analysis with all text lines in the document each with a complete list of documents which contain the same line. A second section shows the linking documents in ascending order of number of citations from the original document. Because documents from different hosts are most interesting here, they are listed at the top of the page as possible 'copypasta' source.	12 years ago
Michael Peter Christen	8f2d3ce2f9	reduced locking situation in crawler: shifted synchronized location and reduced time-out of robots.txt load limit	12 years ago
Michael Peter Christen	038f956821	fix for sitemap detection: the sitemap url was not visible if it appeared after the declaration of robots allow/deny for the crawler because the sitemap parser terminated after the allow/deny rules had been found. Now the parser reads the robots.txt until the end to discover also sitemap rules at the end of the file.	12 years ago
Michael Peter Christen	008288719c	fix for schema export to consider also automatically generated coordinate fields	12 years ago
Michael Peter Christen	58e1e6fa2b	fixes to schema	12 years ago
Michael Peter Christen	788288eb9e	added the generation of 50 (!!) new solr field in the core 'webgraph'. The default schema uses only some of them and the resting search index has now the following properties: - webgraph size will have about 40 times as much entries as default index - the complete index size will increase and may be about the double size of current amount As testing showed, not much indexing performance is lost. The default index will be smaller (moved fields out of it); thus searching can be faster. The new index will cause that some old parts in YaCy can be removed, i.e. specialized webgraph data and the noload crawler. The new index will make it possible to: - search within link texts of linked but not indexed documents (about 20 times of document index in size!!) - get a very detailed link graph - enhance ranking using a complete link graph To get the full access to the new index, the API to solr has now two access points: one with attribute core=collection1 for the default search index and core=webgraph to the new webgraph search index. This is also avaiable for p2p operation but client access is not yet implemented.	12 years ago
Michael Peter Christen	91a0401d59	introduced a second core named 'webgraph'. This core will hold the link structure, but is not filled yet. To have the opportunity of a second core, multi-core functionality had to be implemented to the deep-embedded solr: - migrated the solr_40 directory content to a subdirectory 'collection1'; the previously used default core is now called collection1 - added solr_40/webgraph subdirectory as second core - added a servlet configuration for the second core 'webgraph' in /IndexSchema_p.html - added instance handling as addition to solr connections: all solr connectors are now instances of an solr 'instance' object; this required a complete re-design of the solr embedding - migrated also caching and sharding ontop of new instance handling - migrated the search apis to handle now the access to a specific core, the default core named 'collection1' - migrated the remote solr search interface to access shards of cores; for the yacy remote search the default core is now called 'solr'; using the peer address as solr address - migrated the solr backup and restore process: old backups cannot be used after this migration! - redesign of solr instance handling in all methods which access the instances: they cannot hold copies of these instances any more; the must retrieve the actuall connection object every time they want to write to it (this solves also some bugs when switching the index/network) - added another schema 'solr.webgraph.schema', the old solr.keys.list is replaced by solr.collection.schema	12 years ago
Michael Peter Christen	b6de1f42dc	Full redesign of solr connection architecture. This was done to support multiple solr cores instead of just one. Therefore it is now necessary to distuingish between solr server connections (called an 'Instance') and a connection to a single solr core. One Instance may now have multiple connector classes assigned to it, each connecting to a single core. To support multiple cores it is also necessary to distinguish between the connection configuration and the configuration of the index schema. We will have multiple schema configurations in the future, each for every solr core. This caused that the IndexFederated servlet had to be split into two parts, the new Servlet for the Schema editor is now in the IndexSchema Servlet.	12 years ago
Michael Peter Christen	dee8b24d3c	better error handling for bookmarks	12 years ago
Michael Peter Christen	3834829b37	bugfixes and more logging for solr connector	12 years ago
Michael Peter Christen	99185d7048	one more fix for author_sxt	12 years ago
Michael Peter Christen	b6ae6262f6	- add the copyField author_sxt only if author exists - set the solr default search field according to existing fields	12 years ago
Michael Peter Christen	e23a596c1d	added a copyField for author_sxt for automated schema generation	12 years ago
Michael Peter Christen	244b157299	fix for external solr schema definition	12 years ago
reger	f301336adf	fix: no results with configuration citation reference index switched off - urlcitationindex != null check added to ResultEntry.referencesCount - plus other places where conflicting procedure was used (and urlcitationindex not already checked != null)	12 years ago
Michael Peter Christen	cb5cbec14d	distinguishing modified query string and original query string	12 years ago
Michael Peter Christen	3de784c8dd	replaced more split and replaceAll missing pattern pre-compilation with pre-compiled pattern	12 years ago
Michael Peter Christen	8fc3679c66	using more pre-compile pattern for split methods	12 years ago
Michael Peter Christen	4eab3aae60	removed overhead by preventing generation of full search results when only the url is requested	12 years ago
Michael Peter Christen	952e143580	FINALLY YaCy can now search for full strings using double- or singlequoted strings in the search query line!!!	12 years ago
orbiter	5dfd6359cb	redesign of the QueryParams class: introduced QueryGoal which holds the query string parser. This shall be used to create a proper full-string matching which is handled then by QueryGoal.	12 years ago
Michael Peter Christen	5fd3b93661	added deletion of hosts during crawl start if deleteold option was given	12 years ago
Michael Peter Christen	d64445c3cb	because we have the inurl:<term> - searchmodifier, we don't actually need regular expressions as search attributes. They had now been removed from the advanced search page while they are still created internally. The filter is then expressed against solr as regular expression filter query. If the expression points out a selection of an specific protocol, host or filetype this is then translated into a facetted query.	12 years ago
Michael Peter Christen	2d9e577ad0	replaced the custom robots.txt loader by the standard http loader	12 years ago
Michael Peter Christen	ccc3760a47	Refactoring and redesign of data architecture to make URIMetadataRow superfluous. The target is to make a solr document as the core of YaCy documents which would cause that many conversions can be removed. On the way to this target the Equivalence of URIMetadataRow and URIMetadataNode had to be removed to expose the usage of the old URIMetadataRow data structure. This refactoring already removes unneccessary conversions and should make memory usage during indexing lower.	12 years ago
Michael Peter Christen	43f3345c90	- removed dependencies from URIMetadataRow and made direct access to URIMetadataNode which creates the opportunity to access Solr objects directly and use their information richness - lazy initialization of the URIMetadataNode object - should cause less computation and memory usage during search. - removed dead code	12 years ago
Michael Peter Christen	21fe8339b4	- enhanced generation of url objects - enhanced computation of link structure graphics - enhanced collection of data for link structures	12 years ago
Michael Peter Christen	5f0ab25382	removed the option to prevent removal of & parts inside of the MultiProtocolURI during normalform computation because that should always be done and also be done during initialization of the MultiProtocolURI Object. The new normalform method takes only one argument which should be 'true' unless you know exactly what you are doing.	12 years ago
Michael Peter Christen	abab291162	made the index schema retrieval public and allow cross-domain retrieval	12 years ago
Michael Peter Christen	1533bfd63b	refactoring	12 years ago
Michael Peter Christen	872f83ebe0	refactoring	12 years ago
Michael Peter Christen	8219a445f3	refactoring	12 years ago
Michael Peter Christen	00c1c777fa	refactoring	12 years ago
orbiter	563d584420	removed more dependencies in cora from kelondro	12 years ago
orbiter	63762d8f89	removed kelondro dependencies from cora	12 years ago
Michael Peter Christen	b69ed96f0b	- added collections to yacydoc - changed yacydoc.htm to yacydoc.json - added query logging in solr and gsa search result	12 years ago
Michael Peter Christen	4d29f59a27	removed warnings	12 years ago
Michael Peter Christen	8c099d2106	Merge remote-tracking branch 'origin/master' Conflicts: htroot/api/ymarks/import_ymark.java source/de/anomic/data/ymark/YMarkEntry.java source/de/anomic/data/ymark/YMarkTables.java	12 years ago
apfelmaennchen	d31a632951	- added dmoz RDF dump importer - added indexing to Tables columns to support larger bookmark collections - added RDF output (HTTP) for public bookmarks at /YMarks.rdf - YMarkRDF also provides a Jena RDF Model as "internal" API - various other changes/fixes for YMarks (mainly backend)	12 years ago
Michael Peter Christen	8ca842b137	added new button design to more buttons	12 years ago
Michael Peter Christen	b2b516cc3e	added a collection attribute to crawls and searches: - a solr field collection_sxt can be used to store a set of crawl tags - when this field is activated, a crawl tag can be assigned when crawls are started - the content of the collection field can be comma-separated, all of them are assigned to the documents when they are indexed as result of such a crawl start - a search result can be drilled down to a specific collection; this is currently only available in the solr interface and also in the gsa interface using the 'site' option - this adds a mandatory field for gsa queries (the google api demands that field all the time)	12 years ago
Michael Peter Christen	a427a68bac	removed many warnings	12 years ago
Michael Peter Christen	31d4d38804	- extended the solr interface by a references-by-word-count method - reduced danger that a non-existing RWI database causes NPEs - added Solr queries to did-you-mean: this makes it possible that our did-you-mean algorithm works together with only Solr and without RWIs	12 years ago
Michael Peter Christen	528d6763fa	- added new solr fields: title_count_i, title_chars_val, title_words_val description_count_i, description_chars_val, description_words_val - added many asserts to ensure data type correctness from YaCy to Solr and vice versa - made many fixes according to new findings from these asserts (!)	12 years ago
Michael Peter Christen	75d5e3475d	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	12 years ago
Michael Peter Christen	316b5fe116	- added a solr type definition verifier - fixed type definition found by the verifier - added multivalue-string fields for solr with extension 'sxt' - added multivalue-integer fields for solr with extension 'val' - renamed some solr attributes from txt to sxt - changed solr query line to an explicit AND/OR structure - added a country code second level domain list to Domains class; with parser - added a host string parser to get domain class name, country-code second-level domain and subdomain out of it - removed old coordinate attributes	12 years ago
reger	2d2be546fe	fix path to env/grafics to display api icon on meta data page	12 years ago
Michael Peter Christen	0cab06c47c	refactoring	12 years ago
Michael Peter Christen	06a78eecb7	code simplification	12 years ago
Michael Peter Christen	18f989dfb1	- refactoring (load -> getMetadata) - added getDocument to retrieve Solr documents which shall replace getMetadata	12 years ago
Michael Peter Christen	136fcb1ad9	refactoring	12 years ago
Michael Peter Christen	24d9db1613	snippet retrieval loading processes may use a smaller minimum load time value than crawling processes. This speeds up the search result preparation dramatically.	12 years ago
Michael Peter Christen	1687737771	Abstraction of HandleMap and HandleSet	12 years ago
Michael Peter Christen	6f1ddb2519	Moved solr index-add method to the same method where the YaCy index is written. Also done some code-cleanup.	12 years ago
orbiter	69e743d9e3	- more abstraction for the RWI index as preparation for solr integration - added options in search index to switch parts of the index on or off	12 years ago
Michael Peter Christen	f78ce93a80	collection of speed and memory saving hacks	13 years ago
orbiter	0cbda0b2b8	- replaced all length() == 0 and size() == 0 with isEmpty() - replaced some length() > 0 and size() > 0 with !isEmpty() - cannot be done automatically - implemented some isEmpty() methods	13 years ago
Michael Peter Christen	b0c408788b	made class methods static where possible	13 years ago
Michael Peter Christen	5bd3c90907	- removed unnecessary semicolons - added default case for switch	13 years ago
Michael Peter Christen	241dd8410a	removed snippet pattern filter - it was not used	13 years ago
Michael Peter Christen	d3964253ae	- added @SuppressWarnings to unused servlet method parameters - removed unnecessary casts - removed unnecessary throw statements	13 years ago
Michael Peter Christen	ea10766bfd	cleaned unnecessary nested code	13 years ago
Michael Peter Christen	1825f165b8	better integration of blacklist according to use case	13 years ago
Michael Peter Christen	03280fb161	removed segments-concept and the Segments class: the segments had been there to create a tenant-infrastructure but were never be used since that was all much too complex. There will be a replacement using a solr navigation using a segment field in the search index.	13 years ago
Michael Peter Christen	9116013c64	- allow lazy initialization of solr value (if using 'lazy', then no 0-values and no empty strings are written). This may save a lot of memory (in ram and on disc) if excessive 0-values or empty strings appear) - do not allow default boolean values for checkboxes because that does not make sense: browsers may omit the checkbox attribute name if the box is not checked. A default value 'true' would not comply with the semantic of the browsers response. - add a checkbox in IndexFederated_p for the lazy initialization of solr fields.	13 years ago
cominch	011f8a5818	Auto Tagging: Add hyperlinks to tags (provisional)	13 years ago
Michael Peter Christen	52f5d40043	better abstraction of document model generation	13 years ago
Michael Peter Christen	8b7c4d3144	produce a rdf output containing the triplestore with yacydoc; ie: http://localhost:8090/api/yacydoc.rdf?urlhash=yOiCM7Fh1hyQ	13 years ago
cominch	d8815db877	Merge remote-tracking branch 'original yacy/master'	13 years ago
cominch	e4dab19045	Augmented Browsing: added template for document info bar	13 years ago
Michael Peter Christen	b2d1c25ebb	removed warnings/unused entities	13 years ago
Michael Peter Christen	64c0268b2b	show triplestore metadata in yacydoc and viewfile	13 years ago
Roland 'Quix0r' Haeder	edaa09b9b1	Rewrote all String blacklist types to enum 'BlacklistType', closes bug #143 Conflicts: htroot/Supporter.java htroot/yacy/crawlReceipt.java htroot/yacy/transferRWI.java htroot/yacy/transferURL.java source/de/anomic/crawler/CrawlStacker.java source/de/anomic/data/ListManager.java source/net/yacy/peers/Protocol.java source/net/yacy/repository/Blacklist.java source/net/yacy/repository/LoaderDispatcher.java source/net/yacy/search/Switchboard.java source/net/yacy/search/index/MetadataRepository.java source/net/yacy/search/index/Segment.java source/net/yacy/search/query/RWIProcess.java source/net/yacy/search/snippet/MediaSnippet.java	13 years ago
cominch	87a3fbb3c2	interaction javascript	13 years ago
Michael Peter Christen	8b974905ee	changed log-in text for all servlets with authentication: - added hint how to set the password using a shell script - added a shell script to change the password	13 years ago
reger	b2175ea4ef	Add possibility to set custom Solr field names for the YaCy default Solr attributes. - Changing the format of YaCy's solr.key.list while maintainig backward compatibility Federated index config screens adjusted accordingly - modified the Solr update request to use a 3 min Solr autocommit intervall	13 years ago
Michael Peter Christen	c00efc2717	made the solr connection more generic	13 years ago
Michael Peter Christen	453010bd68	- solved problems with backpath normalization - redesigned in/outbound link handover - removed iframe links from inbound/outbound in solr scheme	13 years ago
Michael Peter Christen	0e13022147	- enhanced solr field documentation - added xml api button to IndexFederated_p - the solr schema.xml file can be generated by YaCy	13 years ago
Michael Peter Christen	e377092198	fix to xml output format	13 years ago
Michael Christen	41be98dc9d	extended webstructure api to show together with incoming links also outgoing links	13 years ago
Michael Christen	8f89c8ef07	added information about inbound, outbound and citation links into yacydoc api servlet	13 years ago
Michael Christen	71649a1296	added an api to retrieve the new citation.index with the webstructure.xml api. This api will respond with details about a single URL if requested with 'webstructure.xml?about=[url\|urlhash\|host]'.	13 years ago
Michael Peter Christen	9ad1d8dde2	complete redesign of crawl queue monitoring: do not look at a ready-prepared crawl list but at the stacks of the domains that are stored for balanced crawling. This affects also the balancer since that does not need to prepare the pre-selected crawl list for monitoring. As a effect: - it is no more possible to see the correct order of next to-be-crawled links, since that depends on the actual state of the balancer stack the next time another url is requested for loading - the balancer works better since the next url can be selected according to the current situation and not according to a pre-selected order.	13 years ago
Michael Peter Christen	e2f8f263e8	changed storage of search words: keep order	13 years ago
Michael Peter Christen	c166eb68b6	fixes in solr schema file	13 years ago
Lotus	335a776351	xss hardening on Status.html	13 years ago
Michael Peter Christen	ef5192f8c9	using the generic document parser for crawl starts instead of the html parser. This makes it possible that every type of document can be a crawl start point, not only text documents or html documents. Testet this with a pdf document.	13 years ago
Michael Peter Christen	ce620be783	for for crawl start with smb url	13 years ago
Michael Peter Christen	7053f8ab46	added automatic generation of a solr schema.xml file	13 years ago
Michael Peter Christen	2ee8cbeb2c	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Conflicts: source/net/yacy/search/Switchboard.java	13 years ago
Michael Peter Christen	992dbdf4bb	added noload statistic to servlets	13 years ago
Roland 'Quix0r' Haeder	fa08ed5ae5	Fixed a lot CHMOD rights (no need for execute flag on .java/.html) and introduced local/remote crawl size ratio based check	13 years ago

1 2 3 4 5 ...

367 Commits (a1ac4c3b76ab0ea01c7a2f2e52721d94bff01717)