yacy_search_server

Commit Graph

Author	SHA1	Message	Date
Michael Peter Christen	8c3e5b7b6d	added experimental pdf splitting which enables YaCy to split pdfs during parsing into individual pages and add them all using different URLs. These constructed urls are generated from the source url with an appended page=<pagenumber> attribute to the url get/post properties. This will distinguish the different page entries. The search result list will then replace the post parameter with a url anchor # mark which causes that the original url is presented in the search result. These URLs can be opened directly on the correct page using pdf.js which is now built-in into firefox. That means: if you find a search hit on page 5 and click on the search result, firefox will open the pdf viewer and shows page 5.	10 years ago
Michael Peter Christen	d14114697c	the miss cache does not seem to work, it sometimes contains urlhashes from documents which actually are inside the index. This can be reproduced using the crawl result table at http://localhost:8090/CrawlResults.html?process=5 The cache is temporary disabled to remove the bad behaviour, however a later reactivation of that feater may be possible.	10 years ago
reger	deb75a1dbe	fix refactored size() -> filesize() in YMarkMetadata	10 years ago
reger	198102304b	refactor size() -> filesize() of URIMetadataNode (harmonize with ResultEntry and to not get confused with Collection.size())	10 years ago
reger	c6f634a4f2	remove redundant caching of urlhash in URIMetadataNode (is already cached in underlaying DigestURL .url) upd pom keyword for maven-antrun-plugin	10 years ago
Michael Peter Christen	5516819354	preventing the use of no-cache and expires in case that images are generated dynamically which will stay static in the future. This applies mainly to the search result favicon in front of search hits. These icons will now be generated once, but then caches in the browser. There is also a YaCy-internal cache for these icons which had prevented the re-generation of the icons in YaCy, but this cache is now superfluous since the browser should not call the servlet ViewImage again.	10 years ago
Michael Peter Christen	d3e71ed070	fixes for searches when initialization of large autotagging libraries have not been finished	10 years ago
Michael Peter Christen	28683530cd	fixes to usage of no-cache: use and recognize also the no-store directive	10 years ago
Michael Peter Christen	c9c700b510	reduction of http requests to YaCy using the correct cache-control, expires and last-modified headers in http response.	10 years ago
reger	13cca2b114	fix missing AppPath upd Maven plugin versionid	10 years ago
Michael Peter Christen	65125439fe	added query modifier 'on'. This makes it possible to search for date occurrences within the (web) page documents (not the document last-modified!). This works only if the solr field dates_in_content_sxt is enabled. A search request may then have the form "term on:<date>", like gift on:24.12.2014 gift on:2014/12/24 * on:2014/12/31 For the date format you may use any kind of human-readable date representation(!yes!) - the on:<date> parser tries to identify language and also knows event names, like: bunny on:eastern .. as long as the date term has no spaces inside (use a dot). Further enhancement will be made to accept also strings encapsulated with quotes.	10 years ago
Michael Peter Christen	1cfddea578	added (very experimental) Solr response writer for snapshot image results	10 years ago
Michael Peter Christen	7287dd764e	added url, date, time and page number on pdf snapshot footer	10 years ago
Michael Peter Christen	8b5d074715	fix for image parser (there is a class missing!)	10 years ago
Michael Peter Christen	932faafffe	reactivated on-demand snapshot loading	10 years ago
Michael Peter Christen	2362ad7c34	fix for a count issue in snapshot api	10 years ago
Michael Peter Christen	3354cd63be	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
Michael Peter Christen	9971e197e0	Added a transaction interface to the snapshots: all documents in the snapshots can now be processed with transactions using commit and rollback commands. Furthermore, a large number of monitoring methods had been added to check the success of transactions. The transactions for snapshots have two main components: a rss search API to get information about latest/oldest entries and a commit/rollback API to move entries away from the rss results. This is done by usage of two storage locations for the snapshots, INVENTORY and ARCHIVE. New snapshots are placed to INVENTORY, commited snapshots move to ARCHIVE, rollback snapshots move to INVENTORY again. Normal Workflow: Beside all these options below, usually it is sufficient to process data like this: - call http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=LATESTFIRST - process the rss result and use the <guid> value as <urlhash> (see next command) - for each processed result call http://localhost:8090/api/snapshot.json?command=commit&urlhash=<urlhash> - then you can call the rss feed again and the commited urls are omited from the next set of items. These are the commands to control this: The rss feed: http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=LATESTFIRST http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=OLDESTFIRST http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=ANY http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=LATESTFIRST http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=OLDESTFIRST http://localhost:8090/api/snapshot.rss?state=ARCHIVE&order=LATESTFIRST The feed will return a <urlhash> in the <guid> - field of the rss. This must be used for commit/rollback: Commit/Rollback: http://localhost:8090/api/snapshot.json?command=commit&urlhash=<urlhash> http://localhost:8090/api/snapshot.json?command=rollback&urlhash=<urlhash> The json will return a property list containing the property "result" with possible values "success" or "fail", according of the result. If an "fail" occurs, please look into the log for further info. Monitoring: http://localhost:8090/api/snapshot.json?command=status This shows the total number of entries in the INVENTORY and the ARCHIVE http://localhost:8090/api/snapshot.json?command=list This will result a list of all hosts which have snapshots and the number of entries for the hosts. Counts for INVENTORY and ARCHIVE are listed in the porperties for "count.INVENTORY" and "count.ARCHIVE" http://localhost:8090/api/snapshot.json?command=list&depth=2 The list can be restricted to such which have a specific depth. The list contains then the same host names, but the count values change because only documents at that specific crawl depth are listed http://localhost:8090/api/snapshot.json?command=list&host=yacy.net.80 This lists all urlhashes for the given host, not only an accumulated list of the number of entries http://localhost:8090/api/snapshot.json?command=list&host=yacy.net.80&depth=0 This restricts the list of urlhashes for that host for the given depth http://localhost:8090/api/snapshot.json?command=list&state=INVENTORY http://localhost:8090/api/snapshot.json?command=list&state=ARCHIVE This selects either the INVENTORY or ARCHIVE for all list commands, default is ALL which means that from both snapshot directories the host information is collected and combined. You can use the state option for all the commands as listed above Detailed Information: http://localhost:8090/api/snapshot.json?command=metadata&urlhash=upiFJ7Fh1hyQ This collects metadata information for the given urlhash. This can also be restricted with state=INVENTORY and state=ARCHIVE to test if the document is either in one of these snapshot directories. If an urlhash is not found, an empty result is returned. If an entry was found and the state was not restricted, then the result contains a state property containing the name of the location where the document is, either INVENTORY or ARCHIVE. Hint: If a very large number of documents is inside of INVENTORY, then it could be better to call the rss feed with http://localhost:8090/api/snapshot.rss?state=INVENTORY&order=ANY because that is very efficient.	10 years ago
reger	63846ddb89	add final SolrQueryRequest.close to SolrServlet	10 years ago
reger	9edc7308aa	update to metadata-extractor-2.7.0.jar add 2 simple JUnit test cases for jpeg and tif parsing	10 years ago
Michael Peter Christen	578ae29f1e	added a note that the servlet is linked using web.xml	10 years ago
reger	6c3f36def1	- fix path to default heuristic.cfg - deprecate unused ProxyServlet	10 years ago
Michael Peter Christen	bbf0ac40c3	add the actual DateDetection class... (missed in latest commit)	10 years ago
Michael Peter Christen	66b5a56976	Added and integrated new date detection class which can identify date notions within the fulltext of a document. This class attempts to identify also dates given abbreviated or with missing year or described with names for special days, like 'Halloween'. In case that a date has no year given, the current year and following years are considered. This process is therefore able to identify a large set of dates to a document, either because there are several dates given in the document or the date is ambiguous. Four new Solr fields are used to store the parsing result: dates_in_content_sxt: if date expressions can be found in the content, these dates are listed here in order of the appearances dates_in_content_count_i: the number of entries in dates_in_content_sxt date_in_content_min_dt: if dates_in_content_sxt is filled, this contains the oldest date from the list of available dates #date_in_content_max_dt: if dates_in_content_sxt is filled, this contains the youngest date from the list of available dates, that may also be possibly in the future These fields are deactiviated by default because the evaluation of regular expressions to detect the date is yet too CPU intensive. Maybe future enhancements will cause that this is switched on by default. The purpose of these fields is the creation of calendar-like search facets, to be implemented next.	10 years ago
Michael Peter Christen	c3c2b6999b	fixes on wkhtmltopdf	10 years ago
Michael Peter Christen	114f0afc1e	enable sku as anchor in html response writer	10 years ago
Michael Peter Christen	aa80cb1159	enhanced tagging preparation speed which reduces initialization time for very large vocabularies	10 years ago
Michael Peter Christen	6a1865f507	refactoring date -> lastModified	10 years ago
Michael Peter Christen	ab6cc3c88c	added concurrent generation of snapshot pdfs	10 years ago
Michael Peter Christen	413eeefed4	added character set detection library from http://www-archive.mozilla.org/projects/intl/chardet.html	10 years ago
Michael Peter Christen	7bfc5b80cb	added new options to vocabulary editor: - new switch 'isFacet' which causes that the usage of the vocabulary for search facets is enabled or disabled. This shall be used for large vocabularies sind searched in solr are extremely slow if facets for a large set of alternative terms are generated - new option to disable auto-enrichment from synonyms - new option to add synonyms from another column when importing from csv - automatically recognize double-occurrences in synonyms and bundling terms for such synonyms	10 years ago
Michael Peter Christen	87b53b3572	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
Michael Peter Christen	8df8ffbb6d	enhanced the snapshot functionality: - snapshots can now also be xml files which are extracted from the solr index and stored as individual xml files in the snapshot directory along the pdf and jpg images - a transaction layer was placed above of the snapshot directory to distinguish snapshots into 'inventory' and 'archive'. This may be used to do transactions of index fragments using archived solr search results between peers. This is currently unfinished, we need a protocol to move snapshots from inventory to archive - the SNAPSHOT directory was renamed to snapshot and contains now two snapshot subdirectories: inventory and archive - snapshots may now be generated by everyone, not only such peers running on a server with xkhtml2pdf installed. The expert crawl starts provides the option for snapshots to everyone. PDF snapshots are now optional and the option is only shown if xkhtml2pdf is installed. - the snapshot api now provides the request for historised xml files, i.e. call: http://localhost:8090/api/snapshot.xml?urlhash=Q3dQopFh1hyQ The result of such xml files is identical with solr search results with only one hit. The pdf generation has been moved from the http loading process to the solr document storage process. This may slow down the process a lot and a different version of the process may be needed.	10 years ago
reger	5d67e165d9	remove redundant null check in ResponseHeader.lastModified added a JUnit testcase for ResponseHeader dates (using age()), adjusted age() to pass all tests	10 years ago
reger	5f0bb1214f	modified FieldReIndex to reindex queries with low number of documents first by using a internally a score map with number of documents as score and working through the list from low to high.	10 years ago
reger	e52370728a	fix startup stop on missing HTCACHE/SNAPSHOT directory	10 years ago
reger	e5236aa7ca	Merge origin/master	10 years ago
reger	70cf7060a4	coding fixes suggested in http://mantis.tokeek.de/view.php?id=509 http://mantis.tokeek.de/view.php?id=510	10 years ago
Michael Peter Christen	4fe4bf29ad	added rss feed output to snapshot servlet which can be used to get a list of latest/oldest entries in the snapshot database. This is an example: http://localhost:8090/api/snapshot.rss?depth=2&order=LATESTFIRST&host=yacy.net&maxcount=100 The properties depth, order, host and maxcount can be omited. The meaning of the fields are: host: select only urls from this host or all, if not given depth: select only urls at that crawl depth or all, if not given maxcount: select at most the given number of urls or 10, if not given order: either LATESTFIRST to select the youngest entries, OLDESTFIRST to select the first entries or ANY to select any The rss feed needs administration rights to work, a call to this servlet with rss extension must attach login credentials.	10 years ago
Michael Peter Christen	8b522687e0	added toString() methods to feed classes which makes it possible to export full rss feed files out of the RSSFeed class	10 years ago
reger	568c991405	remove the unused Request variable (fix of prev. commit)	10 years ago
reger	d6539ba597	Merge origin/master	10 years ago
reger	ff18129def	ViewFile servlet: update index if newer, so viewed text and metadata (stored) info is similar - to archive it, use request with profile to allow indexing (defaultglobaltext) and update index (the resource is loaded, parsed anyway, so it's not a expensive operation) Request: remove 2 unused init parameter - number of anchors of the parent - forkfactor sum of anchors of all ancestors	10 years ago
Michael Peter Christen	a304058840	added Image Events as another option to generate images with a mac if no Ghostscript is available or does not work...	10 years ago
Michael Peter Christen	d83de9ecf5	added another path for the convert command because on older Macs ImageMagick has a different installation location	10 years ago
Michael Peter Christen	226aea5914	added a servlet which can create preview images, preview tumbnails and preview pdfs from web pages, i.e.: http://localhost:8090/api/snapshot.png?url=http://yacy.net/en/&width=128&height=128 http://localhost:8090/api/snapshot.jpg?url=http://yacy.net/en/&width=128&height=128 http://localhost:8090/api/snapshot.pdf?url=http://yacy.net/en/ This supports also an on-the-fly generation of the preview documents if the user is an administrator. Otherwise, the servlet fails. To enable this, you must add wkhtmltopdf, imagemagick and (on headless servers) xvfb to your operation system. for detailed instructions, see `97f6089a41`	10 years ago
reger	28456dfc09	skip creation of unused Bluelist contenttransformer	10 years ago
Michael Peter Christen	321840fde3	Replaced all fixed thread pools with cached thread pools. The cached thread pools will flush their cached (dead) threads after 60 seconds. This will cause that YaCy now runs constantly withl about 50 threads, about 100 at peak times. Previously, about 400 threads had been cached and kept in a hibernation state, which caused that the numproc counter in /proc/user_beancounters (exists only in VM-hosted linux) was as high as the cached number of threads. This caused that VM supervisors terminated whole VM sessions if a limit was reached. Many VM providers have limits of numproc=96 which made it virtually impossible to run YaCy on such machines. With this change, it will be possible to run many YaCy instances even on VM hosts.	10 years ago
Michael Peter Christen	7bfab5eb9d	set Busy- and Blocking-Threads to daemon mode (they will now not prevent YaCy from termination if still running)	10 years ago
Michael Peter Christen	e586e423aa	in case that loading from the cache fails, load from wkhtmltopdf without cache using the user agent string given in the crawl profile	10 years ago
Michael Peter Christen	d5bac64421	recognize more html file types for snapshots	10 years ago
Michael Peter Christen	a1ee101079	recognize more html file extensions	10 years ago
Michael Peter Christen	8480641f2d	fix to xvfb-run usage (quotes did not parse in xvfb-run, default values are appropriate)	10 years ago
Michael Peter Christen	68b040e31e	added fail-over missing http proxy service (i.e. overload) and quiet mode	10 years ago
Michael Peter Christen	25a64c51b3	moved snapshot generation out of the html handler to prevent that existing cache entries cause that the handler is not executed	10 years ago
Michael Peter Christen	c35170a305	more logging	10 years ago
Michael Peter Christen	e8be07ec78	grr	10 years ago
Michael Peter Christen	6f81bb756c	wrap wkhtmltopdf with xvfb if necessary	10 years ago
Michael Peter Christen	0119f8665d	more logging when failing to create pdf snapshot	10 years ago
Michael Peter Christen	416fe886e3	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
Michael Peter Christen	60f27bdf49	added the property timeoutrequests to configuration to disable TimeoutRequests. The purpose is to test if YaCy runs better on VMs where there is a limitation of concurrent processes; see /proc/user_beancounters in row numproc; this value is limited and should be low. Try to set timeoutrequests to keep this low. (works only after restart)	10 years ago
Michael Peter Christen	97f6089a41	YaCy can now create web page snapshots as pdf documents which can later be transcoded into jpg for image previews. To create such pdfs you must do: Add wkhtmltopdf and imagemagick to your OS, which you can do: On a Mac download wkhtmltox-0.12.1_osx-cocoa-x86-64.pkg from http://wkhtmltopdf.org/downloads.html and downloadh ttp://cactuslab.com/imagemagick/assets/ImageMagick-6.8.9-9.pkg.zip In Debian do "apt-get install wkhtmltopdf imagemagick" Then check in /Settings_p.html?page=ProxyAccess: "Transparent Proxy" and "Always Fresh" - this is used by wkhtmltopdf to fetch web pages using the YaCy proxy. Using "Always Fresh" it is possible to get all pages from the proxy cache. Finally, you will see a new option when starting an expert web crawl. You can set a maximum depth for crawling which should cause a pdf generation. The resulting pdfs are then available in DATA/HTCACHE/SNAPSHOTS/<host>.<port>/<depth>/<shard>/<urlhash>.<date>.pdf	10 years ago
reger	ff80700aff	replace depreciated Solr DateField.formatExternal with recommended TrieDateField.formatExternal	10 years ago
Michael Peter Christen	9ea120dbe5	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
reger	0c97cc2440	skip unused call parameter for hashSentence()	10 years ago
reger	5790c7242e	skip to tokenize punktuation as word in WordTokenizer remove unused variables in condenser related to Tokenizer	10 years ago
reger	f07392ff17	add. use host port parameter in YaCyApp	10 years ago
Michael Peter Christen	09d2867050	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
Michael Peter Christen	ad0da5f246	added new web page snapshot infrastructure which will lead to the ability to have web page previews in the search results. (This is a stub, no function available with this yet...)	10 years ago
Michael Peter Christen	5f5c7d69d1	added image screenshot generator	10 years ago
Michael Peter Christen	1d45d9405a	security bugfix	10 years ago
Michael Peter Christen	ff728b4aa5	ignore url errors during search	10 years ago
Michael Peter Christen	8317914ce3	changed vocabulary navigator object type to TreeMap to get a specific order into the vocabularies. This is now lexicographic which is not so much random as a hashed order	10 years ago
Michael Peter Christen	d5c1b07768	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
Michael Peter Christen	c0f9f6ac66	added option to change the navbar-default, i.e. usable for dark skins	10 years ago
Michael Peter Christen	10794e8efd	trying facet.method fc instead of fcs to handle large facets	10 years ago
Michael Peter Christen	041b605cfe	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	10 years ago
Michael Peter Christen	f1f74e8626	toString fix	10 years ago
Michael Peter Christen	30276a2b48	prevent that a local Solr search and a local RWI search are running concurrently. When a RWI search result is flushed into the result set, id does Solr Queries (which replaced the old-style Metadata Queries) and they are possibly running concurrently to a previously startet Solr search. Both methods may block each other with IO. To enhance the speed, they are now serialized. Because the Solr search results may result in better results using the more advanced and configurable Ranking methods, this result is preverred over the RWI search result. However, remote RWI search results are still feeded concurrently into the search result as well.	10 years ago
Michael Peter Christen	84763126e0	added option to make the YaCy proxy act as the cache is never stale. If set to 'Always Fresh' the cache is always used if the entry in the cache exist. This is a good way to archive web content and access it without going online again in case the documents exist. To do so, open /Settings_p.html?page=ProxyAccess and check the "Always Fresh" checkbox. This is set do false which behave as set before. If you set this to true, then you have your web archive in DATA/HTCACHE. Copy this to carry around your private copy of the internet!	10 years ago
reger	1e7ee72240	fix path lookup to ./defaults/yacy.badwords (fix of commit `ee277b9b3e`)	10 years ago
reger	7d863d6254	fix empty text facet entry (noticed on Author facet)	10 years ago
Michael Peter Christen	a39419f2ef	more stacks shall be considered for on-demand loading, not only deep-depth stacks to prevent "too many open files" problem	10 years ago
Michael Peter Christen	5bb52f79be	reduce number of calls to queue.size() because that may be a bottleneck during crawling	10 years ago
Michael Peter Christen	4920ab7b76	optimize usage of size() cache	10 years ago
reger	ee277b9b3e	allow for local yacy.stopwords and yacy.badwords list (in DATA/SETTINGS/) if file in DATA/SETTINGS it is loaded otherwise file in ./defaults is loaded (if locale ./defaults/stopwords.xx doesn't exist take solr/lang/stopwords_xx.txt as default) move yacy.stopwords, yacy.stopwords.de and yacy.badwords.example out of root directory to ./defaults directory	10 years ago
reger	de56266bcb	remove redundant toLower for topwords	10 years ago
Michael Peter Christen	a34f837592	better delete all files in path when removing host crawl stack	10 years ago
Michael Peter Christen	10b1db430a	if we have many hosts, use on-demand earlier	10 years ago
Michael Peter Christen	1324927e66	prevent division by zero	10 years ago
Michael Peter Christen	2beb6abeb6	disabled crazy sleep loop	10 years ago
Michael Peter Christen	70f03f7c8e	do not cache search requests to Solr if the result is used for doublechecking. If a double-check comes from cached results the doublecheck fails.	10 years ago
Michael Peter Christen	a0b84e4def	use a LinkedHashMap for factes to maintain facet order as given by solr	10 years ago
reger	ef5dc68313	include domtype to searcheventcache id to differenciate between local / global events for reuse of cached events fix for http://mantis.tokeek.de/view.php?id=493	10 years ago
Michael Peter Christen	0dc6e0a5f2	added option to enrich vocabularies with synonyms from synonym database	10 years ago
Michael Peter Christen	6a2a669db4	added loading of the synonyms file from addon/synonyms into the knowledge loader	10 years ago
Michael Peter Christen	c67c5c0709	added new solr schema fields which record the occurences of vocabulary matchings. These matches can be used for result boosting, i.e. if a document contains words from a specific vocabulary, boost it.	10 years ago
Michael Peter Christen	a67a465415	fix field counter for multi-fields in html writer for the solr servlet	10 years ago
Michael Peter Christen	ec9d021568	added option in vocabulary editor to import CSV files with different encodings (preselected windows-type character encoding which is typical for CSV files). Fixed also other problems with character encoding in dictionary files. Automatically generated vocabularies are now also noted in the API steering.	10 years ago
reger	3c818fc912	add a check of java version string >=1.7 to startup class stopping start with error msg on version < 1.7	10 years ago
Michael Peter Christen	0550b54d56	added fix to postprocessing: avoid caching of postprocessing collection to always get fresh lists of documents. This is necessary since the postprocessing changes the same documents which the postprocessing-collection query selects.	10 years ago
Michael Peter Christen	68e8039fd1	added high-precision scheduler for API processes. This allows also to make the execution in dependency of available RAM or CPU load. The default value for CPU load is 4.0 and the check runs once a minute.	10 years ago
Michael Peter Christen	8aee7f940e	added missing class for latest changes	10 years ago
Michael Peter Christen	97039049e4	fix in key enumeration methods for cases where the enumeration is done in reverse order.	10 years ago
Michael Peter Christen	7e1b0b6712	fix for wildcard patch in search queries	10 years ago
Michael Peter Christen	0a879c98e7	added new 'firstSeen' database table and necessary data structures which hold a date for each URL to record when a url was first seen. This is then used to overwrite the modification date for urls upon recrawl in case that the first-seen date is before the latest document date. This behaviour is necessary due to the common behaviour of content management systems which attach always the current date to all documents. Using the firstSeen database it is possible to approximate a real first document creation date in case that the crawler starts frequently for the same domain. As a result the search results ordered by date have a much better quality and the usage of YaCy as search agent for latest news has a better quality.	10 years ago
Michael Peter Christen	421ee64f33	another fix to ordering of table indexes; fixes also network stats graphics	10 years ago
Michael Peter Christen	1db476c67e	fix for bad table iteration	10 years ago
reger	e4316e2d74	skip creation of local var in proxyhandler.storetocache	10 years ago
sixcooler	9c6e3a6b1c	fix assertation-failure in version-string for Solr-4.10.2 by changing the assert - hope that is ok + add forgotten NB-Projekt-changes	10 years ago
sixcooler	725b206fb4	update to solr-/lucene-4.10.2	10 years ago
Michael Peter Christen	5c97ecb30f	fix of bad query generation for search facets	10 years ago
Michael Peter Christen	95d87f00b3	fix for bad query generation in doublecheck in postprocessing	10 years ago
orbiter	72c2bc5189	fix for search in case where local peer has no local seed address in portal mode	10 years ago
orbiter	5be352da99	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	10 years ago
orbiter	0fcd8097a3	removed unused options from BusyThreads	10 years ago
Michael Peter Christen	fe8b1d137d	emergency bugfix for 100% CPU in image drawing	10 years ago
Michael Peter Christen	92007e5d2d	more enhancements to posprocessing speed	10 years ago
Michael Peter Christen	9a7fe9e0d1	fix for bad timing computation in postprocessing	10 years ago
Michael Peter Christen	bd16119a00	another fix for postprocessing (the query for "" on numeric field did not work in external solr)	10 years ago
Michael Peter Christen	327e83bfe7	more fixes in postprocessing: partitioning of the complete queue to enable smaller queries	10 years ago
orbiter	2bc6199408	more concurrency for postprocessing	10 years ago
orbiter	a83cf26c38	more fixes and enhancements to postprocessing	10 years ago
orbiter	71758f0d62	enhanced postprocessing by usage of a field-list generation to prevent lazy initialization of the documents. This is useful because the documents must be read completely anyway.	10 years ago
orbiter	7856fbdbe8	fix for npe (in rare cases)	10 years ago
orbiter	8a2b569d7c	fix for literal computation	10 years ago
orbiter	856da2712b	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	10 years ago
orbiter	ca9cd7b58a	more IPv6 fixes	10 years ago
Michael Peter Christen	b4585e9546	added new index size history image in /Status.html page	10 years ago
Michael Peter Christen	167c5a51f0	IPv6 fix	10 years ago
Michael Peter Christen	fe537679de	fix for exact_signature_unique_b, exact_signature_copycount_i, fuzzy_signature_unique_b and fuzzy_signature_copycount_i: apply same criteria for 'valid document' as for title and description uniqueness test.	10 years ago
sixcooler	eb9d2705d2	fix for ConnectionInfo.cleanup of server-connections	10 years ago
Michael Peter Christen	2e5214eb21	added field postprocessing.partialUpdate to settings which can be used to switch on or off partial updates. Both options should cause the same result. Default is on.	10 years ago
Michael Peter Christen	11074d8d24	fix for a ssl bug that appear only in java 7. The bug was reported in http://forum.yacy-websuche.de/viewtopic.php?f=23&t=5407&p=30956#p30956 a solution was described in http://teknosrc.com/javax-net-ssl-sslprotocolexception-handshake-alert-unrecognized_name-solved/ which worked for this example given in the yacy forum	10 years ago
Michael Peter Christen	e96490e3a1	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
Michael Peter Christen	77662e08e1	concurrently initialize the error cache; extended also the cache by factor 10 up to 1000 entries. This error cache is only used to catch up paused crawls between shutdown+startup	10 years ago
sixcooler	d8fcc4a2f5	added a timeout on Jetty connectors	10 years ago
Michael Peter Christen	0f0b60404b	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
sixcooler	72561926aa	do not overwrite yacy.conf in case of an exception may be a fix for http://mantis.tokeek.de/view.php?id=180	10 years ago
Michael Peter Christen	07c5b57953	removed warnings	10 years ago
orbiter	fa2ad101ec	enhanced graphics computation (avoiding long string parsing for colours)	10 years ago
orbiter	ef813cec91	added proper copyright notice to OSM tiles presented at the search result page	10 years ago
Michael Peter Christen	fca11701f0	better profiling of solr queries	10 years ago
Michael Peter Christen	2e09da9832	npe fix	10 years ago
Michael Peter Christen	d80418f1b1	added partial updates to solr during postprocessing: during postprocessing the solr documents are now not completely retrieved. instead, only fiels, needed for the postprocessing are extracted. When Solr document are written, this is done using partial updates. This increases postprocessing speed by about 50% for embedded Solr configurations. For external Solr configurations the enhancement should be much higher because the postprocessing with remote Solr is very slow. When doing partial updates to a remote Solr, this method should perform much better than before, it is expected that this is even much higher than the increase with local Solr.	10 years ago
Michael Peter Christen	b1cfbc4a04	added new solr field url_paths_count_i which can be used to enhance the index browser and maybe also for ranking; possibly also for SEO-with-YaCy applications.	10 years ago
Michael Peter Christen	e69883d5ab	fix-fix for `30d4402cd1`	10 years ago
Michael Peter Christen	30d4402cd1	fixed location search	10 years ago
Michael Peter Christen	6983dff334	explain crawl denial when not switched to intranet mode	10 years ago
Michael Peter Christen	f818f84adb	more ipv6 fixes	10 years ago
Michael Peter Christen	afd5bd5f5f	slightly enhanced Network table computation by using a lazy initialized bitfield for peer flags	10 years ago
Michael Peter Christen	2c2b50e65d	refactoring (class name should start with uppercase letter)	10 years ago
Michael Peter Christen	bc275dca07	added network history graph image /NetworkHistory.png which can show many different statistics about the history of the peer.	10 years ago
Marc Nause	ce9368246b	Merge branch 'master' of gitorious.org:yacy/rc1	10 years ago
Marc Nause	5603809deb	Minor changes: ) reduced visibility of a method ) updated comments	10 years ago
Michael Peter Christen	d8beafba3a	fix for values in CrawlProfileEditor table and xml; now the full profile is available in the xml.	10 years ago
Michael Peter Christen	ec95dfa2e6	fixed crawl profile xml result which did not show the correct crawl status.	10 years ago
Michael Peter Christen	8c1a89cb34	added another decoration flag to switch off network graphics in crawler monitor and index browser: decoration.grafics.linkstructure Please set this to false to remove the graphics from the interface.	10 years ago
Michael Peter Christen	ee27be3399	misc bugfixes (concurrency, memory protection)	10 years ago
Michael Peter Christen	9b1958e8ca	more ipv6 bugfixes	10 years ago
Michael Peter Christen	7817fc50c9	added a high cpu cycle monitor to PerformanceQueues	10 years ago
Michael Peter Christen	5082feb103	less volume for effect sounds	10 years ago
Michael Peter Christen	e8392e2ff2	fix for local search	10 years ago
Michael Peter Christen	0bfc69b29b	more ipv6 bugfixes	10 years ago
Michael Peter Christen	a27563e5c3	removed the atmo sound clips because they had been too large	10 years ago
Michael Peter Christen	883622306e	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Conflicts: source/net/yacy/peers/Protocol.java	10 years ago
Michael Peter Christen	97995a1dd9	fix for remote search process	10 years ago
Michael Peter Christen	0843b12ef3	ipv6 fix: avoid that shrinked own ip set is overwritten with (non-valid) set of local IPs	10 years ago
Michael Peter Christen	92c5d97486	fix for bad node flag setting with IPv6	10 years ago
orbiter	c27bad9326	more ipv6 fixes	10 years ago
orbiter	cddf884bc4	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	10 years ago
Michael Peter Christen	460858fb22	more ipv6 fixes	10 years ago
Michael Peter Christen	5cef88a315	argh.. adding missing java class for latest audio feature	10 years ago
Michael Peter Christen	74957f3760	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
Michael Peter Christen	2a052f446a	Added an experimental audio feedback system. This is the first element of a new 'decoration' component which may hold switches for different external appearance parameters. The first switch in that context is decoration.audio (as usual in yacy.init). This value is set to false by default, that means the audio feedback element is switched off by default. To switch it on, set decoration.audio = true (using /ConfigProperties_p.html). You will then hear sounds for the following events: - remote searches - incoming dht transmissions - new documents from the crawler Sound clips are stored in htroot/env/soundclips/ which is done so because a future implementation will read these files using the http client and with configurable urls which will make it very easy for the user to replace the given sounds with own sounds.	10 years ago
Marc Nause	1e6e69bc40	Finished implementation of UPNP: ) will try other ports if YaCy standard ports are not available ) distinguish between internal and external port (not sure if this works 100%) Still to add: propery in config to enter own external port (in case of manually configured NAT)	10 years ago
Michael Peter Christen	d0358e568b	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
Michael Peter Christen	e1bc768f9d	more IPv6 bugfixes	10 years ago
reger	59c6532a65	add link extraction to pdfParser this extracts clickable links in pdf and adds it to the list of links include a test case for this function this is the corrected comment for commit: `aa2e15d846`	10 years ago
reger	aa2e15d846	allow url parameter in worktable apicall allow url=wwwl?param=a&param=b (with ?, & encoded) fix: http://mantis.tokeek.de/view.php?id=100 fix double adding of '&' in MultiProtocolURL.escape()	10 years ago
orbiter	f3a12801f0	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	10 years ago
orbiter	d93325a578	lazy handling of process_sxt field (part of postprocessing)	10 years ago
Michael Peter Christen	b31db00010	toString fixes	10 years ago
Michael Peter Christen	961f06c0b6	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
reger	209e0f2fe8	allow url parameter in worktable apicall allow url=wwwl?param=a&param=b (with ?, & encoded) fix: http://mantis.tokeek.de/view.php?id=100 fix double adding of '&' in MultiProtocolURL.escape()	10 years ago
reger	b5ca20de15	preserve content_type (mime) if supplied in preference of construct in from file type. (this eventually can benefit image search by using mime only) reduce redundant field assignment for Solrdocuments created from URIMetadataNode (URIMetadataNode = SolrDocument with partially assigned fields)	10 years ago
reger	fe9f1c594e	fix char encoding parameter in UrlProxy	10 years ago
reger	b0c87d8240	fix image search expand box, cut-off of 2nd capture line height tested with IE11 and Firefox 32 (change worked for both to show 2nd line without cutting off height) +fix charset parameter in metadataImageParser +update start errMsgTxt to "java 1.7"	10 years ago
Michael Peter Christen	2c2ed8bf4e	typo in javadoc	10 years ago
Michael Peter Christen	528f583d72	ipv6 fixes	10 years ago
Michael Peter Christen	6ee5b4352d	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
Michael Peter Christen	247e626083	IPv6 host parsing bugfixes	10 years ago
reger	fb1fcc2b03	handle noarchive tag, skip writing page to cache http://mantis.tokeek.de/view.php?id=44	10 years ago
Michael Peter Christen	fe917deb2d	when pinging other peers, be able to select the right IP option	10 years ago
Michael Peter Christen	65e6ae52fb	IPv6-enhanced Network monitoring page	10 years ago
Michael Peter Christen	3073c69aee	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
Michael Peter Christen	6491270b3a	large IPv6 redesign of peer ping methods! removed preferred IPv4 in start options and added a new field IP6 in peer seeds which will contain one or more IPv6 addresses. Now every peer has one or more IP addresses assigned, even several IPv6 addresses are possible. The peer-ping process must check all given and possible IP addresses for a backping and return the one IP which was successful when pinging the peer. The ping-ing peer must be able to recognize which of the given IPs are available for outside access of the peer and store this accordingly. If only one IPv6 address is available and no IPv4, then the IPv6 is stored in the old IP field of the seed DNA. Many methods in Seed.java are now marked as @deprecated because they had been used for a single IP only. There is still a large construction site left in YaCy now where all these deprecated methods must be replaced with new method calls. The 'extra'-IPs, used by cluster assignment had been removed since that can be replaced with IPv6 usage in p2p clusters. All clusters must now use IPv6 if they want an intranet-routing.	10 years ago
reger	eaccce3467	added metadataImageParser for tif and psd (Photoshop) images. This is a modified genericImageParser adding tif (and psd) support even if java ImageIO plugin for tif is not installed in JDK. Adds just tif and psd to the available parsers. Uses the same library to extract metadata, so could eventually be merged with genericImageParser. All detected metadata are added to the parsed document (potentially some more as with genericImageParser)	10 years ago
reger	a69f5358ff	use javax ImageIO getReader to add supported image extension/mime genericImageParser uses javax ImageIO, supported images depend on available plugins for ImageIO package (this is JDK installation specific). Jpeg, png and gif are availabel by default. Tif and others only on avalable plugin (in classpath). Add supported image type dynamically on startup.	10 years ago
reger	8b1ce49ee6	remove unused variable timeout	10 years ago
reger	48aed15c48	skip loader wait cycle on concurrent access in nocache configuration. In nocache config resource is loaded online, leaving no benefit to wait for a faster cache hit.	10 years ago
Michael Peter Christen	67cd4c37bd	activated the new apk parser which was already ready but not included in the parser initialization. To make the apk parser usable, the handling of application type links had to be modified. Now all documents which have not a parser attached are placed to the noload-queue while all other documents are parsed using the associated parser class. This may have side-Effects on other parsers and the display of different file classes (images, apps, videos).	10 years ago
orbiter	a922b122a3	added a hack to forward solr search results from an external attached solr to the YaCy built-in solr search servlet. Its not complete and not fully correct (there is still a utf8 encoding problem) but it is a way to get easily requests forwarded through YaCy to an external Solr.	10 years ago
Michael Peter Christen	025516f682	fix for crawl limit for number of pages fail	10 years ago
Michael Peter Christen	2645dc816a	added warning for not well-formed postprocessing queries	10 years ago
Michael Peter Christen	437ce3b8a0	added internal api for partial updates to Solr	10 years ago
orbiter	3ac31614a3	added option to reverse-sort YaCy tables (internal API change only)	10 years ago
Michael Peter Christen	6d3d4c4ea6	changed the concurrent enumeration of query results in such a way that it is now possible to get the results in two steps: - first retrieve all IDs as given for a query - then retieve each document individually This was necessary for very large result sets where a query may run for hours and is possibly terminated by a solr-internal timeout. This occurs regulary during postprocessing and therefore this commit may fix unwanted postprocessing terminations.	10 years ago
Michael Peter Christen	ad35d9294f	added a 'stats' table which records some peer statistics twice every hour. The table can be shown with http://localhost:8090/Tables_p.html?table=stats The entries have the following meaning: aM: activeLastMonth aW: activeLastWeek aD: activeLastDay aH: activeLastHour cC: countConnected (Active Senior) cD: countDisconnected (Passive Senior) cP: countPotential (Junior) cR: count of the RWI entries cI: size of the index (number of documents) The entry keys are abbreviated to reduce the space in the table as the name is written again for every row. This is the beginning of a 'yacystats' micro-alternative als built-in function in YaCy. Graphics may follow after some time if enough test data is available.	10 years ago
reger	8284ea751a	catch TimeoutException during ping and do not delete yacy.conf during prereadconfigfile found a situation after crash (reboot) with existing running semaphore but YaCy not running. Ping generated exception which finally deleted the conf file (during pre-read procedure) - change to ping (catch exception solved it) - additionally removed delete yacy.conf file (if needed we need to make a backup)	10 years ago
reger	ffa7c7116f	better fix for NPE in image search replace `8931e14514`	10 years ago
Michael Peter Christen	759e7d9538	fix for http://forum.yacy-websuche.de/viewtopic.php?p=30720#p30720	10 years ago
Michael Peter Christen	bf18a39d0e	replaced warning with info	10 years ago
Michael Peter Christen	f1032fb8fe	more enhancements to image search in case that a restriction to a single domain is done	10 years ago
Michael Peter Christen	475125f9d7	hack to get more results when doing a remote site search	10 years ago
Michael Peter Christen	81f9b34da7	increaesed ability ot search for all images on a single server within the p2p remote search	10 years ago
Michael Peter Christen	2c26013c50	better contentdom abstraction	10 years ago
Michael Peter Christen	6a8fb8190b	changed default value for maximum number of connections to 50	10 years ago
Michael Peter Christen	ca8b2bf099	removed www and welcome servlet, these had been demo servlets and are not needed any more	10 years ago
reger	03a7a29db3	limit OAI import urn resolver try for Deutsche National Library The resolver service of National Library uses name space nbn, limit use of nbn-resolving.de accordingly to urn:nbn: - add resolver for rfc's	10 years ago
Michael Peter Christen	0838326a76	changed error message, see http://mantis.tokeek.de/view.php?id=439	10 years ago
reger	b5e0f70197	- remove repositoryPath post from ConfigBasic (obsolete) - remove static snippetComputationTime from ResultEntry (not used)	10 years ago
reger	8931e14514	fix NPE in image search	10 years ago
Michael Peter Christen	1735dbc9d9	enhanced image search: bugfixes and performance enhancements	10 years ago
Michael Peter Christen	ebd0be2cea	fixes and speed updates for search process	10 years ago
Michael Peter Christen	7611bf79bd	Merge branch 'master' of gitorious.org:yacy/icewindxs-rc1 Conflicts: locales/ru.lng	10 years ago
Michael Peter Christen	524bedc00a	fixed text in startup tray icon and added shutdown icon during shutdown	10 years ago
Michael Peter Christen	4709d8417c	npe fix for non-tray users	10 years ago
orbiter	5b5635e187	replaced font for boot tray icon with image and added some more images for further tray icon displays	10 years ago
orbiter	aa6cdc4ab5	speed-up of start process if remote DNS waits for timeout	10 years ago
orbiter	40b3977c21	added an animation of the tray icon during the boot phase of YaCy. Additionally, there is a tooltip and a new headline at the tray menu which states the current booting status.	10 years ago
Michael Peter Christen	ec6082c872	very bad language detection hack fix hack	10 years ago
Michael Peter Christen	39615de3f9	adding the buffer size is not wrong but may cause confusing information when the buffer is cleaned after a buffer flush which is not then available in Solr since that is waiting for a commit. In such cases the counter would run backwards which is prevented by ignoring the buffer size.	10 years ago
Michael Peter Christen	395edec6f1	changed strategy to count the number of documents: get the max of solr+buffer and the hit cache. This shall help during first crawls to see a running document counter even if there was no commit meanwhile to solr. To support that strategy, the hit cache must be written earlier.	10 years ago
Michael Peter Christen	e87dc08c0d	set the correct fail time in error docs	10 years ago
Michael Peter Christen	cfb20bc0ce	removing the [] for ipv6 addresses may be a bad idea..	10 years ago
orbiter	b6d57f06eb	enhanced the apk parser (up to beeing production-ready). The parser is not yet activated and will be after the next release step.	10 years ago
Michael Peter Christen	a7dd89c4de	changed method to write the citation index: do not catch up references during document parsing; instead use the same references that would also be written into the webgraph. That should cause that the webgraph and the citation index express the exact same semantic.	10 years ago
Michael Peter Christen	57ce7eeff3	fixed localhost authorization and replaced the adminRealm with an info string which is visible in the browser. That makes it possible that the browser instructs the user how to change a forgotten admin password (during runtime).	10 years ago
orbiter	f318d7c285	enhanced date-ordered ranking	10 years ago
reger	a6891ff7f8	fix Querygoal.parse exception on +/-null-term covers http://mantis.tokeek.de/view.php?id=452	10 years ago
reger	c7335318eb	remove unused legacy procedure from httpserver (deleted generateSocketAddress(port) )	10 years ago
Michael Peter Christen	eab0d3e1a9	bugfix for wrong lock display, see http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5321&p=30484#p30484	10 years ago
orbiter	49d4f95faf	bugfix to latest commit	10 years ago
orbiter	68211f8244	enable Crawler_p servlet if a rss feed or a wiki dump import was submitted.	10 years ago
orbiter	a65df4ce7e	do not push noindex errors into log if in intranet mode. noindex attributes are attached to artificial constructed index.html files which list directories. Such files are naturally rejected by the crawler and should not appear in the error log because these files are part of the construction of file crawlers and confuse users if they see them in the error log.	10 years ago
orbiter	688c6d8954	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	10 years ago
orbiter	4ae7aead28	addon to latest fix	10 years ago
Marc Nause	2af56fa37d	Improved UPnP. (still not perfect) ) set HTTPS port if enabled ) improved data structures (may not be final) *) moved UPnP to own package	10 years ago
orbiter	b3ebd38079	removed the HTDOCS repository concept because the concept to host files on the YaCy http server is obsolete; YaCy can index file:// and smb:// paths	10 years ago

... 3 4 5 6 7 ...

7739 Commits (59096935d0f071ded85c907716f22106e31dbe28)