yacy_search_server

Commit Graph

Author	SHA1	Message	Date
Michael Peter Christen	e039e78210	small bugfixes	10 years ago
Michael Peter Christen	32a2ff925c	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
Michael Peter Christen	d07cdd8c3b	added SolrCloud access mode and configuration	10 years ago
Michael Peter Christen	8514bffc22	enhanced postprocessing status report	10 years ago
reger	b24572f304	fix GSA filter query assignment - use more parameter constants	10 years ago
Michael Peter Christen	b5fc2b63ea	removed exist() retrieval functions from error cache and replaced it with metadata retrieval from connectors directly. This should cause better usage of the cache. Automatically increase the metadata cache if more memory is available.	11 years ago
Michael Peter Christen	62c72360ee	cleanup of checkAcceptanceInitially in CrawlStacker, should avoid double-calling of solr	11 years ago
Michael Peter Christen	dd5cdfe212	reverted filter query hack, it did not work	11 years ago
Michael Peter Christen	b5d78ba156	reduced number of solr queries during crawling	11 years ago
Michael Peter Christen	5326970d6c	enhanced solr queries for single document extraction	11 years ago
Michael Peter Christen	525575bd97	added debugging of filter queries in thread dump thread names	11 years ago
Michael Peter Christen	f319ef268f	testing filter queries instead of queries to retrieve documents by id	11 years ago
Michael Peter Christen	fd87fa1613	removed more unnecessary exist-checks in ErrorCache	11 years ago
Michael Peter Christen	f2b476e08b	don't do a double check to solr for failed documents if they are not written to solr	11 years ago
Michael Peter Christen	06ab72d1af	enhanced crawler host round-robin strategy	11 years ago
orbiter	dab9a0786a	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
orbiter	51bf5c85b0	Renamed the transmission cloud to buffer in dispatcher since the name 'cloud' was a bad idea. Changed also the accumulation process for peer targets so that every dht chunk is not assigned the set of redundant targets but they are assigned to redundant targets individually. This enhances the granularity of the target accumulation and should enhance the efficiency of the process. Finally the dht protocol client was enriched with the ability to remove the 'accept remote index' flag from peers or remove peers completely if they do not answer at all.	11 years ago
Michael Peter Christen	a694b6a8fc	another fix for unique field computation	11 years ago
Michael Peter Christen	fb3dd56b02	fix for processing of noindex flag in http header	11 years ago
Michael Peter Christen	b0d941626f	fixed bugs in canonical, robots and title/description unique calculation	11 years ago
reger	d9472d043a	cleanup older unused classes	11 years ago
reger	665e12f88e	move startup time from old serverCore to switchboard (most used here) to make servercore eventually obsolete.	11 years ago
reger	336425912a	remove unused localSearchThread from SearchEvent	11 years ago
reger	32bd2a61c1	add local ip to AbstractRemoteHandler local hostname cache	11 years ago
Michael Peter Christen	f3a6b6e21e	fix for bad URL decoding	11 years ago
Michael Peter Christen	1092e798a5	fixed double content postprocessing	11 years ago
Michael Peter Christen	aee5b108e5	added linkScraperParser, a parser which ignores the text like the generic parser but extracts links like the htmlParser. This should be used for ASCII documents without known text format annotation like source code files or json documents. Probably also good for xml files without known schema.	11 years ago
reger	2b8cc5832c	fix seek error for 0 file size records file by add extra check for file size = 0 in cleanlast() - (http://mantis.tokeek.de/view.php?id=411)	11 years ago
reger	2ba394333f	fix Crawler HostQueue release of stackfile - close stackfile inputstream at end of ChunkIterator This should solve startup delay while unfinished crawl jobs exist (maybe also too many open file situation)	11 years ago
reger	40133ba2d0	fix NPE in Condenser, discovered by calling IndexControlRWI, "Word Deletion" with "for every resolvable and deleted URL reference"	11 years ago
orbiter	59160984cc	timeline performance update	11 years ago
orbiter	54bea96e67	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
Michael Peter Christen	841cc77391	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	e09218129c	remove check for local solr. This check was made during a time when Solr was optional and another alternative metadata store was available. Since that store is now removed, Solr is always available (internally or externally)	11 years ago
orbiter	2073e69034	fix for long periods in timeline	11 years ago
reger	1f94df29e7	fix NPE in solr rss where snippet contains only the title text and adjusted xslt, for solr snippets (&hl=true) to decode the xml encoded html <b> tag by adding disable-output-escaping (still open item description may be double as dc: tag and rss.description tag)	11 years ago
Michael Peter Christen	09dcdb9b19	update to solr 4.9.0	11 years ago
Michael Peter Christen	1cd4b2e8be	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	8c52f0651b	refactoring of AccessTracker events & timeline fix	11 years ago
reger	431a5f9c4e	added test case for TextSnippet, removed obsolete/unused parameter and reference to MediaSnippet	11 years ago
Michael Peter Christen	5b94a257ce	no timeout for large reference collections	11 years ago
Michael Peter Christen	f5b817bac4	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
reger	cb2c17d236	extract author and keywords in .doc and .ppt parser	11 years ago
reger	a5707cd2eb	enable proper Author navigator - author facet is based on omitted author_sxt field - adjust to make author nav available on exist of author field but keep using author_sxt to construct the facet (why!?) - add check for querymodifier author in searchevent	11 years ago
Michael Peter Christen	74206a10c7	refactoring	11 years ago
orbiter	fec673c9d1	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
orbiter	4a66af716d	added apkParser stub (work in progress)	11 years ago
orbiter	c59da9fe7a	added access tracker log reader stub	11 years ago
reger	2d67f29244	adjust mergeDocument after parsing to - preserve charset and languages - fix merge of author	11 years ago
Michael Peter Christen	0d29b972cc	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	36e623d8bf	enhanced metadata enrichment for media file type search: - Web servers may now deliver YaCy-specific http header field with a title and keywords. The new http header fields are: X-YaCy-Media-Title - to be used for media (image, audio, video) titles X-YaCy-Media-Keywords - to be used for media (image, audio, video) keywords - both fields are written to document fields title and keywords and are searched also during image search. - to make the usage of arbitrary http header fields (including this new fields) possible in the /api/push_p.json servlet, a new POST argument is also introduced to push http header fields. The new POST attribute is named "responseHeader-X" (where X is the counter). It is allowed to use this attribute as multi-attribute several times, each can be filled with a http header line. - see /api/push_p.html for examples	11 years ago
Michael Peter Christen	49886fab08	enhanced debugging	11 years ago
Michael Peter Christen	b893c42a0f	bugfix for image search	11 years ago
Michael Peter Christen	c7995d3e2a	increased fixed limit for http POST request sizes to 100MB	11 years ago
reger	7847a93558	fix AbstractParser.singleList not adding null strings - prevents null titles in oo... parser (as detected by ParserTest) - correct ParserTest dc_description check (dc_description allowed to return 0 length array)	11 years ago
Michael Peter Christen	8acae852a0	write <em>-tagged texts also into the bold_txt field	11 years ago
reger	90c4576361	add a link to recrawl index entry to metadata html page - to allow manually renew index content for this url (e.g. in case it is a remote search result with metadata only) - use simply a QuickCrawlLink_p javascript snippet (minimalistic 1st solution)	11 years ago
Michael Peter Christen	2626c8f6db	using concurrency to do base64 encoding in file POST commands	11 years ago
Michael Peter Christen	e132689818	fixed and enhanced Base64 (en)coder (again)	11 years ago
Michael Peter Christen	2415e3db43	enhanced ASCII byte[] -> String conversion	11 years ago
Michael Peter Christen	4751ed974f	enhanced base64 encoding	11 years ago
Michael Peter Christen	e949071160	removed superfluous date method	11 years ago
Michael Peter Christen	501d55cd35	removed superfluous assert	11 years ago
orbiter	0bbb5040b8	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
orbiter	9d5d86cd03	Added filter query options to the ranking servlet /RankingSolr_p.html. Filter queries are not actually related to ranking, but user requests have pointed out that specific boost queries to move results to the end of the result list are not sufficient. Such boost filters may be better executed as actual filter and therefore such a filter can now be statically applied to every search request. A typical use could be the expression "http_unique_b:true AND www_unique_b:true" which uses the recently introduced fields http_unique_b and www_unique_b which are true only for one of the alternatives with/without http(s) and with/without prefix 'www.' in host names.	11 years ago
Michael Peter Christen	d2151857f1	Added collection navigation: The collection field (can be filled i.e. in Crawl Start) can be used to add categories to YaCy index entries. The usage of that field was restricted to solr searches and post argument filters as implemented in commit `f7571386a3`. This commit extends collections to a full navigation option in the standard YaCy search interface. The field is not active by default but can be activated easily in the /ConfigSearchPage_p.html servlet (just check the 'Collection' facet field). Collections can now be used for (at least) two purposes: - to provide search tenants (through post argument collection) - to provide self-made category navigation Search requests may now have (independently from switched on or off collection facet) a "collection:<collection-name>" modifier attached; firthermore collection names may use disjunctions using the '\|' pipe symbol. For example, this is a valid search request: www collection:user\|proxy	11 years ago
Michael Peter Christen	74c249288a	added a push api to make it possible to upload files directly without crawling to the YaCy indexer. Files are uploaded using POST multipart requests; multiple file uploads are possible as well. Each file has attached the file date and mime type which is used to get the right parser for the submitted data. Also an url is submitted which is assigned to the document. The CrawlSwitchboard has a new option for default Crawl Profiles which are assigned dynamically from the new push interface.	11 years ago
Michael Peter Christen	f13c8aa7dd	re-implementation of file push option in the context of POST http requests. The internal representation of post-arguments is String and therefore not appropriate for byte[] object as submitted by file pushes. Therefore all pushed files are encoded to base64 _after_ uploading with an http form (you do not need to do that encoding yourself) to hand-over the byte[] as string in the post argument. Servlets which read such files must decode the base64 data to get the original byte[] array. This is considered as a temporary solution for file uploads and a proper implementations would need to consider all attributes as handed over as Objects with either String or byte[] Object instances. This would be a major code change and is not done at this time here now. The feature was submitted to realize a feature as pushed with the next commit.	11 years ago
Michael Peter Christen	ba6ffddefc	refactoring	11 years ago
reger	982601017e	crawling of filenames with + fails due to url decoding modified UTF8.decodeURL to apply x-www-form-urlencoded ( space -> + ) to the query part of the url only.	11 years ago
reger	3b559e7846	optimize pdfParser skip starting reader thread if all content already read	11 years ago
reger	09f73b790f	fix pdfParser not closed warning from pdfbox for encrypted pdf on exit due to missing permission to extract	11 years ago
reger	92d1604a31	Crawler hostbalancer does not delete finished queue files, use alternative delete to fight the sympthom (and fix deletion of host dirs on startup) Root cause (which class holds a lock on .stack) not found. http://mantis.tokeek.de/view.php?id=404	11 years ago
Michael Peter Christen	0c324d735c	NPE fix for postprocessing without term index	11 years ago
Michael Peter Christen	922979aae1	added option to prefer http over https in unique-protocol ranking	11 years ago
Michael Peter Christen	b3b174e2b8	fixed webgraph postprocessing and status display in Crawler_p servlet	11 years ago
Michael Peter Christen	e6b28f5958	removed check on protocol for double content (user request)	11 years ago
reger	d8d318233e	fix logging settings - add missing .level - remove obsolete jena settings - set default level=INFO to prevent debug logging of not explicite specified classes	11 years ago
Michael Peter Christen	698f053658	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	f23c4142e0	added option to configure a custom user agent within allip networks	11 years ago
reger	8e233e2eb4	- fix typo in Message_p (defaultpath) - use more existing switchboardconstants for getproperties - replace depriciated call defaultservlet	11 years ago
orbiter	d7d38f9135	made number of open files in crawler configurable and increased default maximum number of open files from 100 to 1000. This number can be changed with the attribut crawler.onDemandLimit	11 years ago
Michael Peter Christen	8ad41a882c	fixed several problems with postprocessing: - unique-postprocessing was destroying results from other postprocessings; removed cross-updates as they had been not necessary - unique-postprocessing did not restrict on same protocol - inefficient concurrent update cache was redesigned completely - increased limits for concurrent blocking queues to prevent early time-out	11 years ago
reger	ca5437dd50	fix crawl of file:// , also http://mantis.tokeek.de/view.php?id=149 local files can be crawled (intranet mode) url parsing fixed according to RFC 1738 (for unix and windows) for win like file:///c:/tmp or file://localhost/c:/tmp for linux like file:///tmp or file://localhost/tmp Host is ignored and path must be absolute	11 years ago
Michael Peter Christen	ff5b3ac84d	added new fields http_unique_b and www_unique_b which can be used for ranking to prefer urls containing a www subdomain or using the https protocol	11 years ago
sixcooler	5b1c4ef191	Monitoring and limit connection-count for Jetty	11 years ago
Michael Peter Christen	f0db501630	better handling of ranking parameters and new default values for date navigation which is done using ranking in solr.	11 years ago
Michael Peter Christen	53948da7d0	tried to make last_modified recognition smarter	11 years ago
Michael Peter Christen	2d03037965	'Last-Modified', not 'Last-modified' according to http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html	11 years ago
Michael Peter Christen	3dc5fb0050	fix for operator precedence bug (cast binds stronger than bitwise AND) in peer hash hashing. This should not change anything if java casts long to int by masking with 0xFFFFFFFFL but you never know. The important thing is, that the hashCode() should not return numbers that have the same order as the hash code order because hashing of seeds is used to remove the order in some places.	11 years ago
Michael Peter Christen	6634b5b737	debug code for index distribution testing	11 years ago
orbiter	49e344e8d9	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
orbiter	7705e36703	fix for latest generic warning fix	11 years ago
sixcooler	10326892a8	avoid erros from ConnectHandler, correction for #6d16fa9	11 years ago
orbiter	97983ba89f	fixed generics warnings for generic array instantiation that appeared after migration to Java 7	11 years ago
sixcooler	830057d788	lower Segment-size (hope to get Segments of 10GB) see: http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5216&p=30036#p30034	11 years ago
orbiter	c028ae9b09	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
reger	e31493e139	"Use remote proxy for yacy" has no function, remove option and related config item see/fix bug http://mantis.tokeek.de/view.php?id=23 http://mantis.tokeek.de/view.php?id=189	11 years ago
orbiter	181784a5cb	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
reger	0587077d06	cleanup obsolete and not used serverswitch Authentify code as auth is mostly delegated to Jetty container.	11 years ago
orbiter	c9f66be20b	move unnecessary nested else out of condition	11 years ago
orbiter	0d8072aa99	removed warnings	11 years ago
orbiter	88f4af90da	removed warnings	11 years ago
orbiter	0f425e01ca	another circle computation enhancement	11 years ago
reger	a8d162810c	Exclude = from percent-encoding in MultiProtocolURL fix http://mantis.tokeek.de/view.php?id=185 and http://mantis.tokeek.de/view.php?id=280	11 years ago
reger	024f8e9b33	fix truncated urls containing "," adressing http://mantis.tokeek.de/view.php?id=58 Exclude comma from percent-encoding in MultiProtocolURL (see RFC 1738 2.2 and RFC 3986 2.2)	11 years ago
Michael Peter Christen	9112f0a2df	enhanced circle tool initialization	11 years ago
Michael Peter Christen	a1ac4c3b76	automatically clear graphics cache	11 years ago
Michael Peter Christen	505f58c79c	enhanced circle computation time and memory footprint	11 years ago
reger	cd8c0dbda9	assign serialVersionUID for proxyservlet, too.	11 years ago
reger	b300d7f4ce	set serialVersionUID on urlproxyservlet to skip compiler warning - remove commented out code	11 years ago
reger	e9060d31bd	update to Jetty 9 besides adjustments in code it makes the servlet settings in web.xml significant. This applies to solr, gsa and proxy servlet. There is no longer a default setup in code during init (as jetty 9 checks for double definition).	11 years ago
reger	1432a817dd	respect "index media" switched off in CrawlStartExpert.html fix http://mantis.tokeek.de/view.php?id=64	11 years ago
orbiter	39e1913585	next development step: migration to java 1.7 This includes also a small code change to test generic type inference, a java 1.7 feature	11 years ago
Michael Peter Christen	4e734815e8	enhanced snippets: remove lines which are identical to the title and choose longer versions if possible. Prefer the description part.	11 years ago
Michael Peter Christen	e84e07399a	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
orbiter	89f76da24b	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
sixcooler	390f03e041	o not check for segments-count on optimize: this is also done in Solr and our getSegmentsCount() does not return up-to-date values	11 years ago
reger	8a7c68e4c7	content of surrogates/out never accessed (remove) After import the conent is never accessed but may take up a lot of disk space, also the getLoadedOAIServer (which lists the files in surrogate out) is not used. Making the surrogate.out obsolete. Removed keeping of xmls after import.	11 years ago
sixcooler	b8cee9b7d8	remove tables from tabletracker on close to avoid lots of dead entrys in /PerformanceMemory_p.html	11 years ago
reger	1600414450	fix NPE on continuing crawls after YaCy restart (Agent is then nulll)	11 years ago
Michael Peter Christen	229f2248b8	added configuration option for maxmimum load and minimum ram for postprocessing	11 years ago
orbiter	f15c832587	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
Marc Nause	c97da1a0d8	First draft of a blacklist API.	11 years ago
Michael Peter Christen	d4f65833a1	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	c1c1be8f02	fix for slow crawling and better logging in balancer	11 years ago
Michael Peter Christen	3acf416335	npe fix	11 years ago
reger	2eb7682772	add html5 audio/video <source> tag to html content scraper - <source src=.. type=..> tag content is added to embed collection	11 years ago
reger	0b6db04e40	fix contentscraper img height/width parsing prevent numberformat exception on common "100px" property - include in test case	11 years ago
reger	ffc5b75c73	optimize and fix lat / lon assignment	11 years ago
reger	9313447de2	reimplement tighter lat/lon calc in URIMetadataNode from old MetadataRow, considering http://mantis.tokeek.de/view.php?id=272	11 years ago
reger	d812f80784	add exit proxy link to UrlProxy on proxied pages a link to exit proxy is added to top of page. Link text can be configured in web.xml init-parameter (see default/web.xml). If missing no link is displayed.	11 years ago
reger	78d08998db	throw MalformedURLException on unknown protocol on other than the supported http https ftp file smb \\ mailto	11 years ago
reger	bb8181b2be	fix: resolve url without path but searchpart e.g. http://yacy.net?q=test was resolved as host "yacy.net?q=test" now host="yacy.net" path="/" fixes http://mantis.tokeek.de/view.php?id=47 added test case for getHost	11 years ago
orbiter	a3542f29b4	npe fix	11 years ago
orbiter	c48d2a2a02	npe fix	11 years ago
reger	121d25be38	recover sax fatal error on OAI-PMH import of xml with entity error this allows to continue loading next resumptionToken even if import file caused sax parser error fix http://mantis.tokeek.de/view.php?id=63	11 years ago
reger	81dc2aa536	add current css to HTMLResponseWriter to fix metadata view (using css from metas.template except js links)	11 years ago
orbiter	2fd8a0ead6	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
orbiter	8e5ce7cd51	fixed a situation where finished crawls had not been detected.	11 years ago
orbiter	2f63bd0261	enhanced Host Balancer strategy: fair round robin	11 years ago
orbiter	0c88a32c36	do not apply lazy value instantiation for numeric or boolean values because that is misleading and confusing in case of 0- or false-values and may cause NPEs in retrieval functions.	11 years ago
orbiter	8e04030596	in case of short memory, do not cut down robinson peers to 1, just reduce by 50%	11 years ago
reger	86f6975edc	exclude html tags in in/outboundlinks_anchortext_txt parsed text - some outboundlinks_anchortext_txt in index contain e.g. <span>text</span> or more tags, remove all tags for text property (inline img tags are still parsed) - added test case for above (to htmlParserTest) - fix solr test case	11 years ago
orbiter	ccb1864d55	catch IllegalArgumentException for wrong process types (that is needed for migrations when new process types are introduced or disappear)	11 years ago
orbiter	4ee4ba1576	fix for NPE in IndexCreateParserErrors_p.html caused by bad handling of lazy value instantiation of 0-value in crawldepth_i	11 years ago
orbiter	12ba890205	removed warnings	11 years ago
reger	d51f9cc863	add custom Jetty errorhandler to provide custom error page footer line - remove redundant mime check in UrlProxyServlet	11 years ago
reger	c193a02023	defer creation of new ArrayList after possible early return (to skip not used object allocation)	11 years ago
reger	727dfb5875	refactore URIMetadataNode to further unify interaction with index - URIMetadataNode extending SolrDocument - use language as stored (String), reducing conversion to string - optimize debug code in transferIndex	11 years ago

1 2 3 4 5 ...

7317 Commits (b93ea4e2a6e23af37f72440add98e0129e9e03dc)