yacy_search_server

Commit Graph

Author	SHA1	Message	Date
Michael Peter Christen	6a1865f507	refactoring date -> lastModified	10 years ago
Michael Peter Christen	ab6cc3c88c	added concurrent generation of snapshot pdfs	10 years ago
Michael Peter Christen	413eeefed4	added character set detection library from http://www-archive.mozilla.org/projects/intl/chardet.html	10 years ago
Michael Peter Christen	7bfc5b80cb	added new options to vocabulary editor: - new switch 'isFacet' which causes that the usage of the vocabulary for search facets is enabled or disabled. This shall be used for large vocabularies sind searched in solr are extremely slow if facets for a large set of alternative terms are generated - new option to disable auto-enrichment from synonyms - new option to add synonyms from another column when importing from csv - automatically recognize double-occurrences in synonyms and bundling terms for such synonyms	10 years ago
Michael Peter Christen	87b53b3572	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
Michael Peter Christen	8df8ffbb6d	enhanced the snapshot functionality: - snapshots can now also be xml files which are extracted from the solr index and stored as individual xml files in the snapshot directory along the pdf and jpg images - a transaction layer was placed above of the snapshot directory to distinguish snapshots into 'inventory' and 'archive'. This may be used to do transactions of index fragments using archived solr search results between peers. This is currently unfinished, we need a protocol to move snapshots from inventory to archive - the SNAPSHOT directory was renamed to snapshot and contains now two snapshot subdirectories: inventory and archive - snapshots may now be generated by everyone, not only such peers running on a server with xkhtml2pdf installed. The expert crawl starts provides the option for snapshots to everyone. PDF snapshots are now optional and the option is only shown if xkhtml2pdf is installed. - the snapshot api now provides the request for historised xml files, i.e. call: http://localhost:8090/api/snapshot.xml?urlhash=Q3dQopFh1hyQ The result of such xml files is identical with solr search results with only one hit. The pdf generation has been moved from the http loading process to the solr document storage process. This may slow down the process a lot and a different version of the process may be needed.	10 years ago
reger	5d67e165d9	remove redundant null check in ResponseHeader.lastModified added a JUnit testcase for ResponseHeader dates (using age()), adjusted age() to pass all tests	10 years ago
reger	5f0bb1214f	modified FieldReIndex to reindex queries with low number of documents first by using a internally a score map with number of documents as score and working through the list from low to high.	10 years ago
reger	e52370728a	fix startup stop on missing HTCACHE/SNAPSHOT directory	10 years ago
reger	e5236aa7ca	Merge origin/master	10 years ago
reger	70cf7060a4	coding fixes suggested in http://mantis.tokeek.de/view.php?id=509 http://mantis.tokeek.de/view.php?id=510	10 years ago
Michael Peter Christen	4fe4bf29ad	added rss feed output to snapshot servlet which can be used to get a list of latest/oldest entries in the snapshot database. This is an example: http://localhost:8090/api/snapshot.rss?depth=2&order=LATESTFIRST&host=yacy.net&maxcount=100 The properties depth, order, host and maxcount can be omited. The meaning of the fields are: host: select only urls from this host or all, if not given depth: select only urls at that crawl depth or all, if not given maxcount: select at most the given number of urls or 10, if not given order: either LATESTFIRST to select the youngest entries, OLDESTFIRST to select the first entries or ANY to select any The rss feed needs administration rights to work, a call to this servlet with rss extension must attach login credentials.	10 years ago
Michael Peter Christen	8b522687e0	added toString() methods to feed classes which makes it possible to export full rss feed files out of the RSSFeed class	10 years ago
reger	568c991405	remove the unused Request variable (fix of prev. commit)	10 years ago
reger	d6539ba597	Merge origin/master	10 years ago
reger	ff18129def	ViewFile servlet: update index if newer, so viewed text and metadata (stored) info is similar - to archive it, use request with profile to allow indexing (defaultglobaltext) and update index (the resource is loaded, parsed anyway, so it's not a expensive operation) Request: remove 2 unused init parameter - number of anchors of the parent - forkfactor sum of anchors of all ancestors	10 years ago
Michael Peter Christen	a304058840	added Image Events as another option to generate images with a mac if no Ghostscript is available or does not work...	10 years ago
Michael Peter Christen	d83de9ecf5	added another path for the convert command because on older Macs ImageMagick has a different installation location	10 years ago
Michael Peter Christen	226aea5914	added a servlet which can create preview images, preview tumbnails and preview pdfs from web pages, i.e.: http://localhost:8090/api/snapshot.png?url=http://yacy.net/en/&width=128&height=128 http://localhost:8090/api/snapshot.jpg?url=http://yacy.net/en/&width=128&height=128 http://localhost:8090/api/snapshot.pdf?url=http://yacy.net/en/ This supports also an on-the-fly generation of the preview documents if the user is an administrator. Otherwise, the servlet fails. To enable this, you must add wkhtmltopdf, imagemagick and (on headless servers) xvfb to your operation system. for detailed instructions, see `97f6089a41`	10 years ago
reger	28456dfc09	skip creation of unused Bluelist contenttransformer	10 years ago
Michael Peter Christen	321840fde3	Replaced all fixed thread pools with cached thread pools. The cached thread pools will flush their cached (dead) threads after 60 seconds. This will cause that YaCy now runs constantly withl about 50 threads, about 100 at peak times. Previously, about 400 threads had been cached and kept in a hibernation state, which caused that the numproc counter in /proc/user_beancounters (exists only in VM-hosted linux) was as high as the cached number of threads. This caused that VM supervisors terminated whole VM sessions if a limit was reached. Many VM providers have limits of numproc=96 which made it virtually impossible to run YaCy on such machines. With this change, it will be possible to run many YaCy instances even on VM hosts.	10 years ago
Michael Peter Christen	7bfab5eb9d	set Busy- and Blocking-Threads to daemon mode (they will now not prevent YaCy from termination if still running)	10 years ago
Michael Peter Christen	e586e423aa	in case that loading from the cache fails, load from wkhtmltopdf without cache using the user agent string given in the crawl profile	10 years ago
Michael Peter Christen	d5bac64421	recognize more html file types for snapshots	10 years ago
Michael Peter Christen	a1ee101079	recognize more html file extensions	10 years ago
Michael Peter Christen	8480641f2d	fix to xvfb-run usage (quotes did not parse in xvfb-run, default values are appropriate)	10 years ago
Michael Peter Christen	68b040e31e	added fail-over missing http proxy service (i.e. overload) and quiet mode	10 years ago
Michael Peter Christen	25a64c51b3	moved snapshot generation out of the html handler to prevent that existing cache entries cause that the handler is not executed	10 years ago
Michael Peter Christen	c35170a305	more logging	10 years ago
Michael Peter Christen	e8be07ec78	grr	10 years ago
Michael Peter Christen	6f81bb756c	wrap wkhtmltopdf with xvfb if necessary	10 years ago
Michael Peter Christen	0119f8665d	more logging when failing to create pdf snapshot	10 years ago
Michael Peter Christen	416fe886e3	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
Michael Peter Christen	60f27bdf49	added the property timeoutrequests to configuration to disable TimeoutRequests. The purpose is to test if YaCy runs better on VMs where there is a limitation of concurrent processes; see /proc/user_beancounters in row numproc; this value is limited and should be low. Try to set timeoutrequests to keep this low. (works only after restart)	10 years ago
Michael Peter Christen	97f6089a41	YaCy can now create web page snapshots as pdf documents which can later be transcoded into jpg for image previews. To create such pdfs you must do: Add wkhtmltopdf and imagemagick to your OS, which you can do: On a Mac download wkhtmltox-0.12.1_osx-cocoa-x86-64.pkg from http://wkhtmltopdf.org/downloads.html and downloadh ttp://cactuslab.com/imagemagick/assets/ImageMagick-6.8.9-9.pkg.zip In Debian do "apt-get install wkhtmltopdf imagemagick" Then check in /Settings_p.html?page=ProxyAccess: "Transparent Proxy" and "Always Fresh" - this is used by wkhtmltopdf to fetch web pages using the YaCy proxy. Using "Always Fresh" it is possible to get all pages from the proxy cache. Finally, you will see a new option when starting an expert web crawl. You can set a maximum depth for crawling which should cause a pdf generation. The resulting pdfs are then available in DATA/HTCACHE/SNAPSHOTS/<host>.<port>/<depth>/<shard>/<urlhash>.<date>.pdf	10 years ago
reger	ff80700aff	replace depreciated Solr DateField.formatExternal with recommended TrieDateField.formatExternal	10 years ago
Michael Peter Christen	9ea120dbe5	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
reger	0c97cc2440	skip unused call parameter for hashSentence()	10 years ago
reger	5790c7242e	skip to tokenize punktuation as word in WordTokenizer remove unused variables in condenser related to Tokenizer	10 years ago
reger	f07392ff17	add. use host port parameter in YaCyApp	10 years ago
Michael Peter Christen	09d2867050	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
Michael Peter Christen	ad0da5f246	added new web page snapshot infrastructure which will lead to the ability to have web page previews in the search results. (This is a stub, no function available with this yet...)	10 years ago
Michael Peter Christen	5f5c7d69d1	added image screenshot generator	10 years ago
Michael Peter Christen	1d45d9405a	security bugfix	10 years ago
Michael Peter Christen	ff728b4aa5	ignore url errors during search	10 years ago
Michael Peter Christen	8317914ce3	changed vocabulary navigator object type to TreeMap to get a specific order into the vocabularies. This is now lexicographic which is not so much random as a hashed order	10 years ago
Michael Peter Christen	d5c1b07768	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
Michael Peter Christen	c0f9f6ac66	added option to change the navbar-default, i.e. usable for dark skins	10 years ago
Michael Peter Christen	10794e8efd	trying facet.method fc instead of fcs to handle large facets	10 years ago
Michael Peter Christen	041b605cfe	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	10 years ago
Michael Peter Christen	f1f74e8626	toString fix	10 years ago
Michael Peter Christen	30276a2b48	prevent that a local Solr search and a local RWI search are running concurrently. When a RWI search result is flushed into the result set, id does Solr Queries (which replaced the old-style Metadata Queries) and they are possibly running concurrently to a previously startet Solr search. Both methods may block each other with IO. To enhance the speed, they are now serialized. Because the Solr search results may result in better results using the more advanced and configurable Ranking methods, this result is preverred over the RWI search result. However, remote RWI search results are still feeded concurrently into the search result as well.	10 years ago
Michael Peter Christen	84763126e0	added option to make the YaCy proxy act as the cache is never stale. If set to 'Always Fresh' the cache is always used if the entry in the cache exist. This is a good way to archive web content and access it without going online again in case the documents exist. To do so, open /Settings_p.html?page=ProxyAccess and check the "Always Fresh" checkbox. This is set do false which behave as set before. If you set this to true, then you have your web archive in DATA/HTCACHE. Copy this to carry around your private copy of the internet!	10 years ago
reger	1e7ee72240	fix path lookup to ./defaults/yacy.badwords (fix of commit `ee277b9b3e`)	10 years ago
reger	7d863d6254	fix empty text facet entry (noticed on Author facet)	10 years ago
Michael Peter Christen	a39419f2ef	more stacks shall be considered for on-demand loading, not only deep-depth stacks to prevent "too many open files" problem	10 years ago
Michael Peter Christen	5bb52f79be	reduce number of calls to queue.size() because that may be a bottleneck during crawling	10 years ago
Michael Peter Christen	4920ab7b76	optimize usage of size() cache	10 years ago
reger	ee277b9b3e	allow for local yacy.stopwords and yacy.badwords list (in DATA/SETTINGS/) if file in DATA/SETTINGS it is loaded otherwise file in ./defaults is loaded (if locale ./defaults/stopwords.xx doesn't exist take solr/lang/stopwords_xx.txt as default) move yacy.stopwords, yacy.stopwords.de and yacy.badwords.example out of root directory to ./defaults directory	10 years ago
reger	de56266bcb	remove redundant toLower for topwords	10 years ago
Michael Peter Christen	a34f837592	better delete all files in path when removing host crawl stack	10 years ago
Michael Peter Christen	10b1db430a	if we have many hosts, use on-demand earlier	10 years ago
Michael Peter Christen	1324927e66	prevent division by zero	10 years ago
Michael Peter Christen	2beb6abeb6	disabled crazy sleep loop	10 years ago
Michael Peter Christen	70f03f7c8e	do not cache search requests to Solr if the result is used for doublechecking. If a double-check comes from cached results the doublecheck fails.	10 years ago
Michael Peter Christen	a0b84e4def	use a LinkedHashMap for factes to maintain facet order as given by solr	10 years ago
reger	ef5dc68313	include domtype to searcheventcache id to differenciate between local / global events for reuse of cached events fix for http://mantis.tokeek.de/view.php?id=493	10 years ago
Michael Peter Christen	0dc6e0a5f2	added option to enrich vocabularies with synonyms from synonym database	10 years ago
Michael Peter Christen	6a2a669db4	added loading of the synonyms file from addon/synonyms into the knowledge loader	10 years ago
Michael Peter Christen	c67c5c0709	added new solr schema fields which record the occurences of vocabulary matchings. These matches can be used for result boosting, i.e. if a document contains words from a specific vocabulary, boost it.	10 years ago
Michael Peter Christen	a67a465415	fix field counter for multi-fields in html writer for the solr servlet	10 years ago
Michael Peter Christen	ec9d021568	added option in vocabulary editor to import CSV files with different encodings (preselected windows-type character encoding which is typical for CSV files). Fixed also other problems with character encoding in dictionary files. Automatically generated vocabularies are now also noted in the API steering.	10 years ago
reger	3c818fc912	add a check of java version string >=1.7 to startup class stopping start with error msg on version < 1.7	10 years ago
Michael Peter Christen	0550b54d56	added fix to postprocessing: avoid caching of postprocessing collection to always get fresh lists of documents. This is necessary since the postprocessing changes the same documents which the postprocessing-collection query selects.	10 years ago
Michael Peter Christen	68e8039fd1	added high-precision scheduler for API processes. This allows also to make the execution in dependency of available RAM or CPU load. The default value for CPU load is 4.0 and the check runs once a minute.	10 years ago
Michael Peter Christen	8aee7f940e	added missing class for latest changes	10 years ago
Michael Peter Christen	97039049e4	fix in key enumeration methods for cases where the enumeration is done in reverse order.	10 years ago
Michael Peter Christen	7e1b0b6712	fix for wildcard patch in search queries	10 years ago
Michael Peter Christen	0a879c98e7	added new 'firstSeen' database table and necessary data structures which hold a date for each URL to record when a url was first seen. This is then used to overwrite the modification date for urls upon recrawl in case that the first-seen date is before the latest document date. This behaviour is necessary due to the common behaviour of content management systems which attach always the current date to all documents. Using the firstSeen database it is possible to approximate a real first document creation date in case that the crawler starts frequently for the same domain. As a result the search results ordered by date have a much better quality and the usage of YaCy as search agent for latest news has a better quality.	10 years ago
Michael Peter Christen	421ee64f33	another fix to ordering of table indexes; fixes also network stats graphics	10 years ago
Michael Peter Christen	1db476c67e	fix for bad table iteration	10 years ago
reger	e4316e2d74	skip creation of local var in proxyhandler.storetocache	10 years ago
sixcooler	9c6e3a6b1c	fix assertation-failure in version-string for Solr-4.10.2 by changing the assert - hope that is ok + add forgotten NB-Projekt-changes	10 years ago
sixcooler	725b206fb4	update to solr-/lucene-4.10.2	10 years ago
Michael Peter Christen	5c97ecb30f	fix of bad query generation for search facets	10 years ago
Michael Peter Christen	95d87f00b3	fix for bad query generation in doublecheck in postprocessing	10 years ago
orbiter	72c2bc5189	fix for search in case where local peer has no local seed address in portal mode	10 years ago
orbiter	5be352da99	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	10 years ago
orbiter	0fcd8097a3	removed unused options from BusyThreads	10 years ago
Michael Peter Christen	fe8b1d137d	emergency bugfix for 100% CPU in image drawing	10 years ago
Michael Peter Christen	92007e5d2d	more enhancements to posprocessing speed	10 years ago
Michael Peter Christen	9a7fe9e0d1	fix for bad timing computation in postprocessing	10 years ago
Michael Peter Christen	bd16119a00	another fix for postprocessing (the query for "" on numeric field did not work in external solr)	10 years ago
Michael Peter Christen	327e83bfe7	more fixes in postprocessing: partitioning of the complete queue to enable smaller queries	10 years ago
orbiter	2bc6199408	more concurrency for postprocessing	10 years ago
orbiter	a83cf26c38	more fixes and enhancements to postprocessing	10 years ago
orbiter	71758f0d62	enhanced postprocessing by usage of a field-list generation to prevent lazy initialization of the documents. This is useful because the documents must be read completely anyway.	10 years ago
orbiter	7856fbdbe8	fix for npe (in rare cases)	10 years ago
orbiter	8a2b569d7c	fix for literal computation	10 years ago
orbiter	856da2712b	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	10 years ago
orbiter	ca9cd7b58a	more IPv6 fixes	10 years ago
Michael Peter Christen	b4585e9546	added new index size history image in /Status.html page	10 years ago
Michael Peter Christen	167c5a51f0	IPv6 fix	10 years ago
Michael Peter Christen	fe537679de	fix for exact_signature_unique_b, exact_signature_copycount_i, fuzzy_signature_unique_b and fuzzy_signature_copycount_i: apply same criteria for 'valid document' as for title and description uniqueness test.	10 years ago
sixcooler	eb9d2705d2	fix for ConnectionInfo.cleanup of server-connections	10 years ago
Michael Peter Christen	2e5214eb21	added field postprocessing.partialUpdate to settings which can be used to switch on or off partial updates. Both options should cause the same result. Default is on.	10 years ago
Michael Peter Christen	11074d8d24	fix for a ssl bug that appear only in java 7. The bug was reported in http://forum.yacy-websuche.de/viewtopic.php?f=23&t=5407&p=30956#p30956 a solution was described in http://teknosrc.com/javax-net-ssl-sslprotocolexception-handshake-alert-unrecognized_name-solved/ which worked for this example given in the yacy forum	10 years ago
Michael Peter Christen	e96490e3a1	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
Michael Peter Christen	77662e08e1	concurrently initialize the error cache; extended also the cache by factor 10 up to 1000 entries. This error cache is only used to catch up paused crawls between shutdown+startup	10 years ago
sixcooler	d8fcc4a2f5	added a timeout on Jetty connectors	10 years ago
Michael Peter Christen	0f0b60404b	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
sixcooler	72561926aa	do not overwrite yacy.conf in case of an exception may be a fix for http://mantis.tokeek.de/view.php?id=180	10 years ago
Michael Peter Christen	07c5b57953	removed warnings	10 years ago
orbiter	fa2ad101ec	enhanced graphics computation (avoiding long string parsing for colours)	10 years ago
orbiter	ef813cec91	added proper copyright notice to OSM tiles presented at the search result page	10 years ago
Michael Peter Christen	fca11701f0	better profiling of solr queries	10 years ago
Michael Peter Christen	2e09da9832	npe fix	10 years ago
Michael Peter Christen	d80418f1b1	added partial updates to solr during postprocessing: during postprocessing the solr documents are now not completely retrieved. instead, only fiels, needed for the postprocessing are extracted. When Solr document are written, this is done using partial updates. This increases postprocessing speed by about 50% for embedded Solr configurations. For external Solr configurations the enhancement should be much higher because the postprocessing with remote Solr is very slow. When doing partial updates to a remote Solr, this method should perform much better than before, it is expected that this is even much higher than the increase with local Solr.	10 years ago
Michael Peter Christen	b1cfbc4a04	added new solr field url_paths_count_i which can be used to enhance the index browser and maybe also for ranking; possibly also for SEO-with-YaCy applications.	10 years ago
Michael Peter Christen	e69883d5ab	fix-fix for `30d4402cd1`	10 years ago
Michael Peter Christen	30d4402cd1	fixed location search	10 years ago
Michael Peter Christen	6983dff334	explain crawl denial when not switched to intranet mode	10 years ago
Michael Peter Christen	f818f84adb	more ipv6 fixes	10 years ago
Michael Peter Christen	afd5bd5f5f	slightly enhanced Network table computation by using a lazy initialized bitfield for peer flags	10 years ago
Michael Peter Christen	2c2b50e65d	refactoring (class name should start with uppercase letter)	10 years ago
Michael Peter Christen	bc275dca07	added network history graph image /NetworkHistory.png which can show many different statistics about the history of the peer.	10 years ago
Marc Nause	ce9368246b	Merge branch 'master' of gitorious.org:yacy/rc1	10 years ago
Marc Nause	5603809deb	Minor changes: ) reduced visibility of a method ) updated comments	10 years ago
Michael Peter Christen	d8beafba3a	fix for values in CrawlProfileEditor table and xml; now the full profile is available in the xml.	10 years ago
Michael Peter Christen	ec95dfa2e6	fixed crawl profile xml result which did not show the correct crawl status.	10 years ago
Michael Peter Christen	8c1a89cb34	added another decoration flag to switch off network graphics in crawler monitor and index browser: decoration.grafics.linkstructure Please set this to false to remove the graphics from the interface.	10 years ago
Michael Peter Christen	ee27be3399	misc bugfixes (concurrency, memory protection)	10 years ago
Michael Peter Christen	9b1958e8ca	more ipv6 bugfixes	10 years ago
Michael Peter Christen	7817fc50c9	added a high cpu cycle monitor to PerformanceQueues	10 years ago
Michael Peter Christen	5082feb103	less volume for effect sounds	10 years ago
Michael Peter Christen	e8392e2ff2	fix for local search	10 years ago
Michael Peter Christen	0bfc69b29b	more ipv6 bugfixes	10 years ago
Michael Peter Christen	a27563e5c3	removed the atmo sound clips because they had been too large	10 years ago
Michael Peter Christen	883622306e	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Conflicts: source/net/yacy/peers/Protocol.java	10 years ago
Michael Peter Christen	97995a1dd9	fix for remote search process	10 years ago
Michael Peter Christen	0843b12ef3	ipv6 fix: avoid that shrinked own ip set is overwritten with (non-valid) set of local IPs	10 years ago
Michael Peter Christen	92c5d97486	fix for bad node flag setting with IPv6	10 years ago
orbiter	c27bad9326	more ipv6 fixes	10 years ago
orbiter	cddf884bc4	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	10 years ago
Michael Peter Christen	460858fb22	more ipv6 fixes	10 years ago
Michael Peter Christen	5cef88a315	argh.. adding missing java class for latest audio feature	10 years ago
Michael Peter Christen	74957f3760	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
Michael Peter Christen	2a052f446a	Added an experimental audio feedback system. This is the first element of a new 'decoration' component which may hold switches for different external appearance parameters. The first switch in that context is decoration.audio (as usual in yacy.init). This value is set to false by default, that means the audio feedback element is switched off by default. To switch it on, set decoration.audio = true (using /ConfigProperties_p.html). You will then hear sounds for the following events: - remote searches - incoming dht transmissions - new documents from the crawler Sound clips are stored in htroot/env/soundclips/ which is done so because a future implementation will read these files using the http client and with configurable urls which will make it very easy for the user to replace the given sounds with own sounds.	10 years ago
Marc Nause	1e6e69bc40	Finished implementation of UPNP: ) will try other ports if YaCy standard ports are not available ) distinguish between internal and external port (not sure if this works 100%) Still to add: propery in config to enter own external port (in case of manually configured NAT)	10 years ago
Michael Peter Christen	d0358e568b	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
Michael Peter Christen	e1bc768f9d	more IPv6 bugfixes	10 years ago
reger	59c6532a65	add link extraction to pdfParser this extracts clickable links in pdf and adds it to the list of links include a test case for this function this is the corrected comment for commit: `aa2e15d846`	10 years ago
reger	aa2e15d846	allow url parameter in worktable apicall allow url=wwwl?param=a&param=b (with ?, & encoded) fix: http://mantis.tokeek.de/view.php?id=100 fix double adding of '&' in MultiProtocolURL.escape()	10 years ago
orbiter	f3a12801f0	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	10 years ago
orbiter	d93325a578	lazy handling of process_sxt field (part of postprocessing)	10 years ago
Michael Peter Christen	b31db00010	toString fixes	10 years ago
Michael Peter Christen	961f06c0b6	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
reger	209e0f2fe8	allow url parameter in worktable apicall allow url=wwwl?param=a&param=b (with ?, & encoded) fix: http://mantis.tokeek.de/view.php?id=100 fix double adding of '&' in MultiProtocolURL.escape()	10 years ago
reger	b5ca20de15	preserve content_type (mime) if supplied in preference of construct in from file type. (this eventually can benefit image search by using mime only) reduce redundant field assignment for Solrdocuments created from URIMetadataNode (URIMetadataNode = SolrDocument with partially assigned fields)	10 years ago
reger	fe9f1c594e	fix char encoding parameter in UrlProxy	10 years ago
reger	b0c87d8240	fix image search expand box, cut-off of 2nd capture line height tested with IE11 and Firefox 32 (change worked for both to show 2nd line without cutting off height) +fix charset parameter in metadataImageParser +update start errMsgTxt to "java 1.7"	10 years ago
Michael Peter Christen	2c2ed8bf4e	typo in javadoc	10 years ago
Michael Peter Christen	528f583d72	ipv6 fixes	10 years ago
Michael Peter Christen	6ee5b4352d	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
Michael Peter Christen	247e626083	IPv6 host parsing bugfixes	10 years ago
reger	fb1fcc2b03	handle noarchive tag, skip writing page to cache http://mantis.tokeek.de/view.php?id=44	10 years ago
Michael Peter Christen	fe917deb2d	when pinging other peers, be able to select the right IP option	10 years ago
Michael Peter Christen	65e6ae52fb	IPv6-enhanced Network monitoring page	10 years ago
Michael Peter Christen	3073c69aee	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	10 years ago
Michael Peter Christen	6491270b3a	large IPv6 redesign of peer ping methods! removed preferred IPv4 in start options and added a new field IP6 in peer seeds which will contain one or more IPv6 addresses. Now every peer has one or more IP addresses assigned, even several IPv6 addresses are possible. The peer-ping process must check all given and possible IP addresses for a backping and return the one IP which was successful when pinging the peer. The ping-ing peer must be able to recognize which of the given IPs are available for outside access of the peer and store this accordingly. If only one IPv6 address is available and no IPv4, then the IPv6 is stored in the old IP field of the seed DNA. Many methods in Seed.java are now marked as @deprecated because they had been used for a single IP only. There is still a large construction site left in YaCy now where all these deprecated methods must be replaced with new method calls. The 'extra'-IPs, used by cluster assignment had been removed since that can be replaced with IPv6 usage in p2p clusters. All clusters must now use IPv6 if they want an intranet-routing.	10 years ago
reger	eaccce3467	added metadataImageParser for tif and psd (Photoshop) images. This is a modified genericImageParser adding tif (and psd) support even if java ImageIO plugin for tif is not installed in JDK. Adds just tif and psd to the available parsers. Uses the same library to extract metadata, so could eventually be merged with genericImageParser. All detected metadata are added to the parsed document (potentially some more as with genericImageParser)	10 years ago
reger	a69f5358ff	use javax ImageIO getReader to add supported image extension/mime genericImageParser uses javax ImageIO, supported images depend on available plugins for ImageIO package (this is JDK installation specific). Jpeg, png and gif are availabel by default. Tif and others only on avalable plugin (in classpath). Add supported image type dynamically on startup.	10 years ago
reger	8b1ce49ee6	remove unused variable timeout	10 years ago
reger	48aed15c48	skip loader wait cycle on concurrent access in nocache configuration. In nocache config resource is loaded online, leaving no benefit to wait for a faster cache hit.	10 years ago
Michael Peter Christen	67cd4c37bd	activated the new apk parser which was already ready but not included in the parser initialization. To make the apk parser usable, the handling of application type links had to be modified. Now all documents which have not a parser attached are placed to the noload-queue while all other documents are parsed using the associated parser class. This may have side-Effects on other parsers and the display of different file classes (images, apps, videos).	11 years ago
orbiter	a922b122a3	added a hack to forward solr search results from an external attached solr to the YaCy built-in solr search servlet. Its not complete and not fully correct (there is still a utf8 encoding problem) but it is a way to get easily requests forwarded through YaCy to an external Solr.	11 years ago
Michael Peter Christen	025516f682	fix for crawl limit for number of pages fail	11 years ago
Michael Peter Christen	2645dc816a	added warning for not well-formed postprocessing queries	11 years ago
Michael Peter Christen	437ce3b8a0	added internal api for partial updates to Solr	11 years ago
orbiter	3ac31614a3	added option to reverse-sort YaCy tables (internal API change only)	11 years ago
Michael Peter Christen	6d3d4c4ea6	changed the concurrent enumeration of query results in such a way that it is now possible to get the results in two steps: - first retrieve all IDs as given for a query - then retieve each document individually This was necessary for very large result sets where a query may run for hours and is possibly terminated by a solr-internal timeout. This occurs regulary during postprocessing and therefore this commit may fix unwanted postprocessing terminations.	11 years ago
Michael Peter Christen	ad35d9294f	added a 'stats' table which records some peer statistics twice every hour. The table can be shown with http://localhost:8090/Tables_p.html?table=stats The entries have the following meaning: aM: activeLastMonth aW: activeLastWeek aD: activeLastDay aH: activeLastHour cC: countConnected (Active Senior) cD: countDisconnected (Passive Senior) cP: countPotential (Junior) cR: count of the RWI entries cI: size of the index (number of documents) The entry keys are abbreviated to reduce the space in the table as the name is written again for every row. This is the beginning of a 'yacystats' micro-alternative als built-in function in YaCy. Graphics may follow after some time if enough test data is available.	11 years ago
reger	8284ea751a	catch TimeoutException during ping and do not delete yacy.conf during prereadconfigfile found a situation after crash (reboot) with existing running semaphore but YaCy not running. Ping generated exception which finally deleted the conf file (during pre-read procedure) - change to ping (catch exception solved it) - additionally removed delete yacy.conf file (if needed we need to make a backup)	11 years ago
reger	ffa7c7116f	better fix for NPE in image search replace `8931e14514`	11 years ago
Michael Peter Christen	759e7d9538	fix for http://forum.yacy-websuche.de/viewtopic.php?p=30720#p30720	11 years ago
Michael Peter Christen	bf18a39d0e	replaced warning with info	11 years ago
Michael Peter Christen	f1032fb8fe	more enhancements to image search in case that a restriction to a single domain is done	11 years ago
Michael Peter Christen	475125f9d7	hack to get more results when doing a remote site search	11 years ago
Michael Peter Christen	81f9b34da7	increaesed ability ot search for all images on a single server within the p2p remote search	11 years ago
Michael Peter Christen	2c26013c50	better contentdom abstraction	11 years ago
Michael Peter Christen	6a8fb8190b	changed default value for maximum number of connections to 50	11 years ago
Michael Peter Christen	ca8b2bf099	removed www and welcome servlet, these had been demo servlets and are not needed any more	11 years ago
reger	03a7a29db3	limit OAI import urn resolver try for Deutsche National Library The resolver service of National Library uses name space nbn, limit use of nbn-resolving.de accordingly to urn:nbn: - add resolver for rfc's	11 years ago
Michael Peter Christen	0838326a76	changed error message, see http://mantis.tokeek.de/view.php?id=439	11 years ago
reger	b5e0f70197	- remove repositoryPath post from ConfigBasic (obsolete) - remove static snippetComputationTime from ResultEntry (not used)	11 years ago
reger	8931e14514	fix NPE in image search	11 years ago
Michael Peter Christen	1735dbc9d9	enhanced image search: bugfixes and performance enhancements	11 years ago
Michael Peter Christen	ebd0be2cea	fixes and speed updates for search process	11 years ago
Michael Peter Christen	7611bf79bd	Merge branch 'master' of gitorious.org:yacy/icewindxs-rc1 Conflicts: locales/ru.lng	11 years ago
Michael Peter Christen	524bedc00a	fixed text in startup tray icon and added shutdown icon during shutdown	11 years ago
Michael Peter Christen	4709d8417c	npe fix for non-tray users	11 years ago
orbiter	5b5635e187	replaced font for boot tray icon with image and added some more images for further tray icon displays	11 years ago
orbiter	aa6cdc4ab5	speed-up of start process if remote DNS waits for timeout	11 years ago
orbiter	40b3977c21	added an animation of the tray icon during the boot phase of YaCy. Additionally, there is a tooltip and a new headline at the tray menu which states the current booting status.	11 years ago
Michael Peter Christen	ec6082c872	very bad language detection hack fix hack	11 years ago
Michael Peter Christen	39615de3f9	adding the buffer size is not wrong but may cause confusing information when the buffer is cleaned after a buffer flush which is not then available in Solr since that is waiting for a commit. In such cases the counter would run backwards which is prevented by ignoring the buffer size.	11 years ago
Michael Peter Christen	395edec6f1	changed strategy to count the number of documents: get the max of solr+buffer and the hit cache. This shall help during first crawls to see a running document counter even if there was no commit meanwhile to solr. To support that strategy, the hit cache must be written earlier.	11 years ago
Michael Peter Christen	e87dc08c0d	set the correct fail time in error docs	11 years ago
Michael Peter Christen	cfb20bc0ce	removing the [] for ipv6 addresses may be a bad idea..	11 years ago
orbiter	b6d57f06eb	enhanced the apk parser (up to beeing production-ready). The parser is not yet activated and will be after the next release step.	11 years ago
Michael Peter Christen	a7dd89c4de	changed method to write the citation index: do not catch up references during document parsing; instead use the same references that would also be written into the webgraph. That should cause that the webgraph and the citation index express the exact same semantic.	11 years ago
Michael Peter Christen	57ce7eeff3	fixed localhost authorization and replaced the adminRealm with an info string which is visible in the browser. That makes it possible that the browser instructs the user how to change a forgotten admin password (during runtime).	11 years ago
orbiter	f318d7c285	enhanced date-ordered ranking	11 years ago
reger	a6891ff7f8	fix Querygoal.parse exception on +/-null-term covers http://mantis.tokeek.de/view.php?id=452	11 years ago
reger	c7335318eb	remove unused legacy procedure from httpserver (deleted generateSocketAddress(port) )	11 years ago
Michael Peter Christen	eab0d3e1a9	bugfix for wrong lock display, see http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5321&p=30484#p30484	11 years ago
orbiter	49d4f95faf	bugfix to latest commit	11 years ago
orbiter	68211f8244	enable Crawler_p servlet if a rss feed or a wiki dump import was submitted.	11 years ago
orbiter	a65df4ce7e	do not push noindex errors into log if in intranet mode. noindex attributes are attached to artificial constructed index.html files which list directories. Such files are naturally rejected by the crawler and should not appear in the error log because these files are part of the construction of file crawlers and confuse users if they see them in the error log.	11 years ago
orbiter	688c6d8954	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
orbiter	4ae7aead28	addon to latest fix	11 years ago
Marc Nause	2af56fa37d	Improved UPnP. (still not perfect) ) set HTTPS port if enabled ) improved data structures (may not be final) *) moved UPnP to own package	11 years ago
orbiter	b3ebd38079	removed the HTDOCS repository concept because the concept to host files on the YaCy http server is obsolete; YaCy can index file:// and smb:// paths	11 years ago
reger	1fdcc2d67b	change seedfile upload ip check to allow intranet ip in intranet mode - this allows to setup a principal peer in intranet environment	11 years ago
reger	e31b0e6d67	- update javadoc Seed.getIP - default mySeed.ip to hostip in SeedDB.initMySeed() if Intranetmode this allows to become senior status in intranet hosted search network with view peers, otherwise peer would stay junior because of default init with loopback ip as public (dna) ip.	11 years ago
reger	350c6b8250	in IntranetMode allow intranet hosted seedlist with Network_Domain "any" - so far intranet seedlist hosts are always denied but need to be allowed in intranet mode	11 years ago
orbiter	d68438c3d9	make sure that the postprocessing background thread never dies by any exception	11 years ago
orbiter	b4f2a1db6e	added a unlock icon for all protected pages that are unlocked because the administrator is logged in.	11 years ago
reger	ea6c9e9b07	reduce mem buffer overhead for gap files during r/w (they are typically small compared to idx allowing to use smaller buffersize -> set to 16k records)	11 years ago
reger	e88537522d	allow single quote " ' " in query see http://mantis.tokeek.de/view.php?id=379 -add QueryGoal test case for this	11 years ago
orbiter	487021fb0a	snippet computation update	11 years ago
orbiter	1c2f1f233a	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
reger	5a4995ded3	fill solr rss writer dc:subject tag with keyword content	11 years ago
orbiter	927aaa95a6	concurrency bugfix	11 years ago
orbiter	c9e593cf78	removed warnings	11 years ago
reger	7584352e7b	use more predefined Solr query parameter constants - use CommonParams and DisMaxParams constants - fix typo in get sort parameter - getDocumentCountByParams redundant implementation and risk of not optimized call (row parameter unspecified) -> as only used from getCountByQuery removed from interface	11 years ago
reger	f9db5dd6c5	reduce doublecontent check document (prevent out of memory) see http://mantis.tokeek.de/view.php?id=437 test result (concurrency=7) 2000 docs = eom always 1000 docs = eom always 100 docs = eom never chosen -> 200 docs (eom not encountered during test with 1GB mem setting)	11 years ago
reger	e9eae45b55	simplify rssreader and improve atom feed link extraction - type detection (rss/atom) - init type parameter overwritten during parse, parameter obsolete - detection by endtag changed to simpler first-tag evaluation - channel image not used, removed related extra parser handling - remove unused code (set/getImage) in rssfeed - atom link extraction to account for possible multipe link tags - spec limits link to one with rel="alternate" or one without rel attribute not accounting for the follwing type & hreflang exception yet: o atom:entry elements MUST NOT contain more than one atom:link element with a rel attribute value of "alternate" that has the same combination of type and hreflang attribute values.	11 years ago
reger	a8508417d1	catch NPE during crawl (OAI import) - condenseDocument mime=null (allowed) - collectionconfiguration responseheader = null (allowed)	11 years ago
reger	3dde94422f	center searchevent lines on network graph (PerformanceSearch_p.html)	11 years ago
Michael Peter Christen	3860711aef	fix for possible interruption of concurrent queries	11 years ago
Michael Peter Christen	6344718f8b	reducing the concurrent query stack size and reduced concurrency of postprocessing to avoid OOM situations	11 years ago
Michael Peter Christen	eca9380e3d	bugfix for crawler double-check: if an url is redirected, the redirect-target was not double-checked. This is now done by replacing the redirect-URL on the crawl queue again (where it is double-checked)	11 years ago
Michael Peter Christen	9ac0c93f17	fix for subpath crawl filter	11 years ago
Michael Peter Christen	66106bdaf0	fix for crawler attribute maxdompages	11 years ago
Michael Peter Christen	49d91b94c3	npe fix in crawler	11 years ago
Michael Peter Christen	b7183a7321	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
reger	ea2e627662	fix ConfigAccounts del user with uppercase letter in name (usernames are case sensitive, userdb.delete used toLower)	11 years ago
Michael Peter Christen	c465b791af	typo	11 years ago
Michael Peter Christen	191ec8c82a	added concurrency to postprocess rewrite process	11 years ago
Michael Peter Christen	a1e8bdd5e9	log ppm instead of docs/second	11 years ago
Michael Peter Christen	cc0ded7abd	set process type of web graph according to fields as defined in the schema	11 years ago
Michael Peter Christen	12fb9d7cd1	log postprocessing constraints in case that postprocessing is not performed	11 years ago
Michael Peter Christen	3c23b89823	less logging	11 years ago
Michael Peter Christen	a0c53174c5	better solr query logging to detect unnecessary sort requests for more performance profiling	11 years ago
Michael Peter Christen	338f574bdc	no sorting if http/www unique fields are not demanded (makes query faster) and some code restrucuring	11 years ago
Michael Peter Christen	1609763be5	toString fix	11 years ago
Michael Peter Christen	b983e68254	more retries, less sleep	11 years ago
Michael Peter Christen	1503ba7794	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
reger	8f77719091	fix "Ljava.lang.String" in crawl queue anchor name (e.g. IndexCreateQueues_p.html?stack=LOCAL with images in queue)	11 years ago
Michael Peter Christen	0ceeceb35e	more logic on Solr queries; usage of the query terms in posprocessing, saving one query for double document detection now per document	11 years ago
orbiter	38864ae004	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
orbiter	4099296b45	added new classes which shall reduce call overhead to Solr (stub)	11 years ago
reger	d0c02e1de7	adjust rss lat/lon to double (common format across other classes)	11 years ago
orbiter	3491ab4c38	removed unused images from webgraph edge computation	11 years ago
orbiter	2371d6b8db	target linktexts must be string to enable search facets on these fields	11 years ago
Michael Peter Christen	001e05bb80	do not store failure of loading of robots.txt into the index as a fail document	11 years ago
Michael Peter Christen	05d58e4df0	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	98f45c9032	fix for image alt attachment to AnchorURLs in html parser.	11 years ago
orbiter	22ce4fb4dd	better error handling for remote solr queries and exists-checks	11 years ago
Marc Nause	9df14fc126	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Marc Nause	477be17c51	Replaced old UPNP library with Weupnp. UPNP should work now, at least it does on my network. UPNP code in YaCy can still be improved though (see TODO comment: make port on gateway configurable or find free one). ) removed old code ) added new lib *) changed code to work with new lib	11 years ago
orbiter	738989aab7	reverted commit `f94c91315b` because the webgraph has not enough performance for that	11 years ago
orbiter	e9163e7e10	fix for malformed hostpath names in crawl balancer	11 years ago
Michael Peter Christen	c115f3869c	enhanced snippet computation and test method in ViewFile	11 years ago
reger	6c10b59f3e	move bootstrap peers test systems to its test class var assignment not needed elsewhere.	11 years ago
orbiter	1027f3d04a	fix for the usage of ready-prepared solr queries, some queries are formulated as edismax query but this was not set as query attribut. The defType=edismax property needs a qf-field, so this was added as well. Do not remove that field again! This fixes also a problem with title-unique computation.	11 years ago
Michael Peter Christen	f94c91315b	if the webgraph is used, then use it also for reference computation to avoid contradictions with references_i in the collection index.	11 years ago
Michael Peter Christen	6e1dc444c3	added a snippet test function in ViewFile: you can now search for a specific word on the document; the servlet returns the snippet in the same way as it would be shown in a search result.	11 years ago
orbiter	4b06adb751	fix for file urls	11 years ago
orbiter	08409ec680	no idea why the words max was an ordered one. This change increaes speed dunring document processin a bit	11 years ago
reger	e5854a5cdb	fix localhost link to opensearchdescription.xml	11 years ago
Michael Peter Christen	b44626e55b	fixed target_alt_t in webgraph	11 years ago
Michael Peter Christen	504327b15c	fix for condition for writing the webgraph	11 years ago
Michael Peter Christen	542c20a597	changed handling of crawl profile field crawlingIfOlder: this should be filled with the date, when the url is recognized as to be outdated. That field was partly misinterpreted and the time interval was filled in. In case that all the urls which are in the index shall be treated as outdated, the field is filled now with Long.MAX_VALUE because then all crawl dates are before that date and therefore outdated.	11 years ago
Michael Peter Christen	4eec1a7452	refactoring (change Metadata name of load time data structure to avoid confusion with Node data which is also called metadata)	11 years ago
reger	c95ba52cf0	improve logexception info - log a message or class name insted of msgtxt "null"	11 years ago
orbiter	e441831a24	reverted toString() change in AnchorURL to prevent mistakenly used toString(). This fixes also the update link bug.	11 years ago
reger	47f201a6b8	Add Solr default query fields (&qf) to select servlet according to the ranking profiles boost fields defined by the peer (if df/qf is not specified in query). This allows for pretty simple queries ( q=word) without the need to know about the specific index configuration. Making sure all relevant fields (as determined by the index owner) are searched, still maintaining the option to query specific fields and does not relay on the duplication of text to text_t. - add author to reset-default boost fields (support results for author nav)	11 years ago
reger	f96cfdc84d	prevent array out of bound exception on getRankingProfile(x) on faulty &profileNr= query parameter	11 years ago
reger	5f5fb4ecdc	remove unused static (RSS)search from protocol	11 years ago
reger	7c1706d83a	use CRLF in generated bat command scripts for windows - for easier viewing with standard viewers	11 years ago
reger	a2cb366b25	Combine /heuristic search modifier with opensearch configured targets - with search modifier /heuristic a request is send to all configured opensearch target systems (old /heuristic/blekko modifier not longer valid) - this allows to use opensearch heuristic on individual search request (in contrast to configuration HEURISTIC_OPENSEARCH=true which sends a osd request on all global searches - the index.html searchoption text adjusted to be displayed only if option configured - add Archive-It to predefined systems	11 years ago
Michael Peter Christen	2de159719b	added an option to set 'obey nofollow' for links with rel="nofollow" attribute in the <a> tag for each crawl. This introduces a lot of changes because it extends the usage of the AnchorURL Object type which now also has a different toString method that the underlying DigestURL.toString. It is therefore not advised to use .toString at all for urls, just just toNormalform(false) instead.	11 years ago
Michael Peter Christen	bf1b6b93e7	do not write CR values to webgraph if no CR values are computed	11 years ago
Michael Peter Christen	e039e78210	small bugfixes	11 years ago
Michael Peter Christen	32a2ff925c	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	d07cdd8c3b	added SolrCloud access mode and configuration	11 years ago
Michael Peter Christen	8514bffc22	enhanced postprocessing status report	11 years ago
reger	b24572f304	fix GSA filter query assignment - use more parameter constants	11 years ago
Michael Peter Christen	b5fc2b63ea	removed exist() retrieval functions from error cache and replaced it with metadata retrieval from connectors directly. This should cause better usage of the cache. Automatically increase the metadata cache if more memory is available.	11 years ago
Michael Peter Christen	62c72360ee	cleanup of checkAcceptanceInitially in CrawlStacker, should avoid double-calling of solr	11 years ago
Michael Peter Christen	dd5cdfe212	reverted filter query hack, it did not work	11 years ago
Michael Peter Christen	b5d78ba156	reduced number of solr queries during crawling	11 years ago
Michael Peter Christen	5326970d6c	enhanced solr queries for single document extraction	11 years ago
Michael Peter Christen	525575bd97	added debugging of filter queries in thread dump thread names	11 years ago
Michael Peter Christen	f319ef268f	testing filter queries instead of queries to retrieve documents by id	11 years ago
Michael Peter Christen	fd87fa1613	removed more unnecessary exist-checks in ErrorCache	11 years ago
Michael Peter Christen	f2b476e08b	don't do a double check to solr for failed documents if they are not written to solr	11 years ago
Michael Peter Christen	06ab72d1af	enhanced crawler host round-robin strategy	11 years ago
orbiter	dab9a0786a	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
orbiter	51bf5c85b0	Renamed the transmission cloud to buffer in dispatcher since the name 'cloud' was a bad idea. Changed also the accumulation process for peer targets so that every dht chunk is not assigned the set of redundant targets but they are assigned to redundant targets individually. This enhances the granularity of the target accumulation and should enhance the efficiency of the process. Finally the dht protocol client was enriched with the ability to remove the 'accept remote index' flag from peers or remove peers completely if they do not answer at all.	11 years ago
Michael Peter Christen	a694b6a8fc	another fix for unique field computation	11 years ago
Michael Peter Christen	fb3dd56b02	fix for processing of noindex flag in http header	11 years ago
Michael Peter Christen	b0d941626f	fixed bugs in canonical, robots and title/description unique calculation	11 years ago
reger	d9472d043a	cleanup older unused classes	11 years ago
reger	665e12f88e	move startup time from old serverCore to switchboard (most used here) to make servercore eventually obsolete.	11 years ago
reger	336425912a	remove unused localSearchThread from SearchEvent	11 years ago
reger	32bd2a61c1	add local ip to AbstractRemoteHandler local hostname cache	11 years ago
Michael Peter Christen	f3a6b6e21e	fix for bad URL decoding	11 years ago
Michael Peter Christen	1092e798a5	fixed double content postprocessing	11 years ago
Michael Peter Christen	aee5b108e5	added linkScraperParser, a parser which ignores the text like the generic parser but extracts links like the htmlParser. This should be used for ASCII documents without known text format annotation like source code files or json documents. Probably also good for xml files without known schema.	11 years ago
reger	2b8cc5832c	fix seek error for 0 file size records file by add extra check for file size = 0 in cleanlast() - (http://mantis.tokeek.de/view.php?id=411)	11 years ago
reger	2ba394333f	fix Crawler HostQueue release of stackfile - close stackfile inputstream at end of ChunkIterator This should solve startup delay while unfinished crawl jobs exist (maybe also too many open file situation)	11 years ago
reger	40133ba2d0	fix NPE in Condenser, discovered by calling IndexControlRWI, "Word Deletion" with "for every resolvable and deleted URL reference"	11 years ago
orbiter	59160984cc	timeline performance update	11 years ago
orbiter	54bea96e67	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
Michael Peter Christen	841cc77391	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	e09218129c	remove check for local solr. This check was made during a time when Solr was optional and another alternative metadata store was available. Since that store is now removed, Solr is always available (internally or externally)	11 years ago
orbiter	2073e69034	fix for long periods in timeline	11 years ago
reger	1f94df29e7	fix NPE in solr rss where snippet contains only the title text and adjusted xslt, for solr snippets (&hl=true) to decode the xml encoded html <b> tag by adding disable-output-escaping (still open item description may be double as dc: tag and rss.description tag)	11 years ago
Michael Peter Christen	09dcdb9b19	update to solr 4.9.0	11 years ago
Michael Peter Christen	1cd4b2e8be	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	8c52f0651b	refactoring of AccessTracker events & timeline fix	11 years ago
reger	431a5f9c4e	added test case for TextSnippet, removed obsolete/unused parameter and reference to MediaSnippet	11 years ago
Michael Peter Christen	5b94a257ce	no timeout for large reference collections	11 years ago
Michael Peter Christen	f5b817bac4	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
reger	cb2c17d236	extract author and keywords in .doc and .ppt parser	11 years ago
reger	a5707cd2eb	enable proper Author navigator - author facet is based on omitted author_sxt field - adjust to make author nav available on exist of author field but keep using author_sxt to construct the facet (why!?) - add check for querymodifier author in searchevent	11 years ago
Michael Peter Christen	74206a10c7	refactoring	11 years ago
orbiter	fec673c9d1	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
orbiter	4a66af716d	added apkParser stub (work in progress)	11 years ago
orbiter	c59da9fe7a	added access tracker log reader stub	11 years ago
reger	2d67f29244	adjust mergeDocument after parsing to - preserve charset and languages - fix merge of author	11 years ago
Michael Peter Christen	0d29b972cc	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	36e623d8bf	enhanced metadata enrichment for media file type search: - Web servers may now deliver YaCy-specific http header field with a title and keywords. The new http header fields are: X-YaCy-Media-Title - to be used for media (image, audio, video) titles X-YaCy-Media-Keywords - to be used for media (image, audio, video) keywords - both fields are written to document fields title and keywords and are searched also during image search. - to make the usage of arbitrary http header fields (including this new fields) possible in the /api/push_p.json servlet, a new POST argument is also introduced to push http header fields. The new POST attribute is named "responseHeader-X" (where X is the counter). It is allowed to use this attribute as multi-attribute several times, each can be filled with a http header line. - see /api/push_p.html for examples	11 years ago
Michael Peter Christen	49886fab08	enhanced debugging	11 years ago
Michael Peter Christen	b893c42a0f	bugfix for image search	11 years ago
Michael Peter Christen	c7995d3e2a	increased fixed limit for http POST request sizes to 100MB	11 years ago
reger	7847a93558	fix AbstractParser.singleList not adding null strings - prevents null titles in oo... parser (as detected by ParserTest) - correct ParserTest dc_description check (dc_description allowed to return 0 length array)	11 years ago
Michael Peter Christen	8acae852a0	write <em>-tagged texts also into the bold_txt field	11 years ago
reger	90c4576361	add a link to recrawl index entry to metadata html page - to allow manually renew index content for this url (e.g. in case it is a remote search result with metadata only) - use simply a QuickCrawlLink_p javascript snippet (minimalistic 1st solution)	11 years ago
Michael Peter Christen	2626c8f6db	using concurrency to do base64 encoding in file POST commands	11 years ago
Michael Peter Christen	e132689818	fixed and enhanced Base64 (en)coder (again)	11 years ago
Michael Peter Christen	2415e3db43	enhanced ASCII byte[] -> String conversion	11 years ago
Michael Peter Christen	4751ed974f	enhanced base64 encoding	11 years ago
Michael Peter Christen	e949071160	removed superfluous date method	11 years ago
Michael Peter Christen	501d55cd35	removed superfluous assert	11 years ago
orbiter	0bbb5040b8	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
orbiter	9d5d86cd03	Added filter query options to the ranking servlet /RankingSolr_p.html. Filter queries are not actually related to ranking, but user requests have pointed out that specific boost queries to move results to the end of the result list are not sufficient. Such boost filters may be better executed as actual filter and therefore such a filter can now be statically applied to every search request. A typical use could be the expression "http_unique_b:true AND www_unique_b:true" which uses the recently introduced fields http_unique_b and www_unique_b which are true only for one of the alternatives with/without http(s) and with/without prefix 'www.' in host names.	11 years ago
Michael Peter Christen	d2151857f1	Added collection navigation: The collection field (can be filled i.e. in Crawl Start) can be used to add categories to YaCy index entries. The usage of that field was restricted to solr searches and post argument filters as implemented in commit `f7571386a3`. This commit extends collections to a full navigation option in the standard YaCy search interface. The field is not active by default but can be activated easily in the /ConfigSearchPage_p.html servlet (just check the 'Collection' facet field). Collections can now be used for (at least) two purposes: - to provide search tenants (through post argument collection) - to provide self-made category navigation Search requests may now have (independently from switched on or off collection facet) a "collection:<collection-name>" modifier attached; firthermore collection names may use disjunctions using the '\|' pipe symbol. For example, this is a valid search request: www collection:user\|proxy	11 years ago
Michael Peter Christen	74c249288a	added a push api to make it possible to upload files directly without crawling to the YaCy indexer. Files are uploaded using POST multipart requests; multiple file uploads are possible as well. Each file has attached the file date and mime type which is used to get the right parser for the submitted data. Also an url is submitted which is assigned to the document. The CrawlSwitchboard has a new option for default Crawl Profiles which are assigned dynamically from the new push interface.	11 years ago
Michael Peter Christen	f13c8aa7dd	re-implementation of file push option in the context of POST http requests. The internal representation of post-arguments is String and therefore not appropriate for byte[] object as submitted by file pushes. Therefore all pushed files are encoded to base64 _after_ uploading with an http form (you do not need to do that encoding yourself) to hand-over the byte[] as string in the post argument. Servlets which read such files must decode the base64 data to get the original byte[] array. This is considered as a temporary solution for file uploads and a proper implementations would need to consider all attributes as handed over as Objects with either String or byte[] Object instances. This would be a major code change and is not done at this time here now. The feature was submitted to realize a feature as pushed with the next commit.	11 years ago
Michael Peter Christen	ba6ffddefc	refactoring	11 years ago
reger	982601017e	crawling of filenames with + fails due to url decoding modified UTF8.decodeURL to apply x-www-form-urlencoded ( space -> + ) to the query part of the url only.	11 years ago
reger	3b559e7846	optimize pdfParser skip starting reader thread if all content already read	11 years ago
reger	09f73b790f	fix pdfParser not closed warning from pdfbox for encrypted pdf on exit due to missing permission to extract	11 years ago
reger	92d1604a31	Crawler hostbalancer does not delete finished queue files, use alternative delete to fight the sympthom (and fix deletion of host dirs on startup) Root cause (which class holds a lock on .stack) not found. http://mantis.tokeek.de/view.php?id=404	11 years ago
Michael Peter Christen	0c324d735c	NPE fix for postprocessing without term index	11 years ago
Michael Peter Christen	922979aae1	added option to prefer http over https in unique-protocol ranking	11 years ago
Michael Peter Christen	b3b174e2b8	fixed webgraph postprocessing and status display in Crawler_p servlet	11 years ago
Michael Peter Christen	e6b28f5958	removed check on protocol for double content (user request)	11 years ago
reger	d8d318233e	fix logging settings - add missing .level - remove obsolete jena settings - set default level=INFO to prevent debug logging of not explicite specified classes	11 years ago
Michael Peter Christen	698f053658	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	f23c4142e0	added option to configure a custom user agent within allip networks	11 years ago
reger	8e233e2eb4	- fix typo in Message_p (defaultpath) - use more existing switchboardconstants for getproperties - replace depriciated call defaultservlet	11 years ago
orbiter	d7d38f9135	made number of open files in crawler configurable and increased default maximum number of open files from 100 to 1000. This number can be changed with the attribut crawler.onDemandLimit	11 years ago
Michael Peter Christen	8ad41a882c	fixed several problems with postprocessing: - unique-postprocessing was destroying results from other postprocessings; removed cross-updates as they had been not necessary - unique-postprocessing did not restrict on same protocol - inefficient concurrent update cache was redesigned completely - increased limits for concurrent blocking queues to prevent early time-out	11 years ago
reger	ca5437dd50	fix crawl of file:// , also http://mantis.tokeek.de/view.php?id=149 local files can be crawled (intranet mode) url parsing fixed according to RFC 1738 (for unix and windows) for win like file:///c:/tmp or file://localhost/c:/tmp for linux like file:///tmp or file://localhost/tmp Host is ignored and path must be absolute	11 years ago
Michael Peter Christen	ff5b3ac84d	added new fields http_unique_b and www_unique_b which can be used for ranking to prefer urls containing a www subdomain or using the https protocol	11 years ago
sixcooler	5b1c4ef191	Monitoring and limit connection-count for Jetty	11 years ago
Michael Peter Christen	f0db501630	better handling of ranking parameters and new default values for date navigation which is done using ranking in solr.	11 years ago
Michael Peter Christen	53948da7d0	tried to make last_modified recognition smarter	11 years ago
Michael Peter Christen	2d03037965	'Last-Modified', not 'Last-modified' according to http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html	11 years ago
Michael Peter Christen	3dc5fb0050	fix for operator precedence bug (cast binds stronger than bitwise AND) in peer hash hashing. This should not change anything if java casts long to int by masking with 0xFFFFFFFFL but you never know. The important thing is, that the hashCode() should not return numbers that have the same order as the hash code order because hashing of seeds is used to remove the order in some places.	11 years ago
Michael Peter Christen	6634b5b737	debug code for index distribution testing	11 years ago
orbiter	49e344e8d9	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
orbiter	7705e36703	fix for latest generic warning fix	11 years ago
sixcooler	10326892a8	avoid erros from ConnectHandler, correction for #6d16fa9	11 years ago
orbiter	97983ba89f	fixed generics warnings for generic array instantiation that appeared after migration to Java 7	11 years ago
sixcooler	830057d788	lower Segment-size (hope to get Segments of 10GB) see: http://forum.yacy-websuche.de/viewtopic.php?f=5&t=5216&p=30036#p30034	11 years ago
orbiter	c028ae9b09	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
reger	e31493e139	"Use remote proxy for yacy" has no function, remove option and related config item see/fix bug http://mantis.tokeek.de/view.php?id=23 http://mantis.tokeek.de/view.php?id=189	11 years ago
orbiter	181784a5cb	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
reger	0587077d06	cleanup obsolete and not used serverswitch Authentify code as auth is mostly delegated to Jetty container.	11 years ago
orbiter	c9f66be20b	move unnecessary nested else out of condition	11 years ago
orbiter	0d8072aa99	removed warnings	11 years ago
orbiter	88f4af90da	removed warnings	11 years ago
orbiter	0f425e01ca	another circle computation enhancement	11 years ago
reger	a8d162810c	Exclude = from percent-encoding in MultiProtocolURL fix http://mantis.tokeek.de/view.php?id=185 and http://mantis.tokeek.de/view.php?id=280	11 years ago
reger	024f8e9b33	fix truncated urls containing "," adressing http://mantis.tokeek.de/view.php?id=58 Exclude comma from percent-encoding in MultiProtocolURL (see RFC 1738 2.2 and RFC 3986 2.2)	11 years ago
Michael Peter Christen	9112f0a2df	enhanced circle tool initialization	11 years ago
Michael Peter Christen	a1ac4c3b76	automatically clear graphics cache	11 years ago
Michael Peter Christen	505f58c79c	enhanced circle computation time and memory footprint	11 years ago
reger	cd8c0dbda9	assign serialVersionUID for proxyservlet, too.	11 years ago
reger	b300d7f4ce	set serialVersionUID on urlproxyservlet to skip compiler warning - remove commented out code	11 years ago
reger	e9060d31bd	update to Jetty 9 besides adjustments in code it makes the servlet settings in web.xml significant. This applies to solr, gsa and proxy servlet. There is no longer a default setup in code during init (as jetty 9 checks for double definition).	11 years ago
reger	1432a817dd	respect "index media" switched off in CrawlStartExpert.html fix http://mantis.tokeek.de/view.php?id=64	11 years ago
orbiter	39e1913585	next development step: migration to java 1.7 This includes also a small code change to test generic type inference, a java 1.7 feature	11 years ago
Michael Peter Christen	4e734815e8	enhanced snippets: remove lines which are identical to the title and choose longer versions if possible. Prefer the description part.	11 years ago
Michael Peter Christen	e84e07399a	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
orbiter	89f76da24b	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
sixcooler	390f03e041	o not check for segments-count on optimize: this is also done in Solr and our getSegmentsCount() does not return up-to-date values	11 years ago
reger	8a7c68e4c7	content of surrogates/out never accessed (remove) After import the conent is never accessed but may take up a lot of disk space, also the getLoadedOAIServer (which lists the files in surrogate out) is not used. Making the surrogate.out obsolete. Removed keeping of xmls after import.	11 years ago
sixcooler	b8cee9b7d8	remove tables from tabletracker on close to avoid lots of dead entrys in /PerformanceMemory_p.html	11 years ago
reger	1600414450	fix NPE on continuing crawls after YaCy restart (Agent is then nulll)	11 years ago
Michael Peter Christen	229f2248b8	added configuration option for maxmimum load and minimum ram for postprocessing	11 years ago
orbiter	f15c832587	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
Marc Nause	c97da1a0d8	First draft of a blacklist API.	11 years ago
Michael Peter Christen	d4f65833a1	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	c1c1be8f02	fix for slow crawling and better logging in balancer	11 years ago
Michael Peter Christen	3acf416335	npe fix	11 years ago
reger	2eb7682772	add html5 audio/video <source> tag to html content scraper - <source src=.. type=..> tag content is added to embed collection	11 years ago
reger	0b6db04e40	fix contentscraper img height/width parsing prevent numberformat exception on common "100px" property - include in test case	11 years ago
reger	ffc5b75c73	optimize and fix lat / lon assignment	11 years ago
reger	9313447de2	reimplement tighter lat/lon calc in URIMetadataNode from old MetadataRow, considering http://mantis.tokeek.de/view.php?id=272	11 years ago
reger	d812f80784	add exit proxy link to UrlProxy on proxied pages a link to exit proxy is added to top of page. Link text can be configured in web.xml init-parameter (see default/web.xml). If missing no link is displayed.	11 years ago
reger	78d08998db	throw MalformedURLException on unknown protocol on other than the supported http https ftp file smb \\ mailto	11 years ago
reger	bb8181b2be	fix: resolve url without path but searchpart e.g. http://yacy.net?q=test was resolved as host "yacy.net?q=test" now host="yacy.net" path="/" fixes http://mantis.tokeek.de/view.php?id=47 added test case for getHost	11 years ago
orbiter	a3542f29b4	npe fix	11 years ago
orbiter	c48d2a2a02	npe fix	11 years ago
reger	121d25be38	recover sax fatal error on OAI-PMH import of xml with entity error this allows to continue loading next resumptionToken even if import file caused sax parser error fix http://mantis.tokeek.de/view.php?id=63	11 years ago
reger	81dc2aa536	add current css to HTMLResponseWriter to fix metadata view (using css from metas.template except js links)	11 years ago
orbiter	2fd8a0ead6	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
orbiter	8e5ce7cd51	fixed a situation where finished crawls had not been detected.	11 years ago
orbiter	2f63bd0261	enhanced Host Balancer strategy: fair round robin	11 years ago
orbiter	0c88a32c36	do not apply lazy value instantiation for numeric or boolean values because that is misleading and confusing in case of 0- or false-values and may cause NPEs in retrieval functions.	11 years ago
orbiter	8e04030596	in case of short memory, do not cut down robinson peers to 1, just reduce by 50%	11 years ago
reger	86f6975edc	exclude html tags in in/outboundlinks_anchortext_txt parsed text - some outboundlinks_anchortext_txt in index contain e.g. <span>text</span> or more tags, remove all tags for text property (inline img tags are still parsed) - added test case for above (to htmlParserTest) - fix solr test case	11 years ago
orbiter	ccb1864d55	catch IllegalArgumentException for wrong process types (that is needed for migrations when new process types are introduced or disappear)	11 years ago
orbiter	4ee4ba1576	fix for NPE in IndexCreateParserErrors_p.html caused by bad handling of lazy value instantiation of 0-value in crawldepth_i	11 years ago
orbiter	12ba890205	removed warnings	11 years ago
reger	d51f9cc863	add custom Jetty errorhandler to provide custom error page footer line - remove redundant mime check in UrlProxyServlet	11 years ago
reger	c193a02023	defer creation of new ArrayList after possible early return (to skip not used object allocation)	11 years ago
reger	727dfb5875	refactore URIMetadataNode to further unify interaction with index - URIMetadataNode extending SolrDocument - use language as stored (String), reducing conversion to string - optimize debug code in transferIndex	11 years ago
reger	79e7947442	- remove empty http0_9 status text array and unused default_charset = ISO-8859-1	11 years ago
reger	2dabe2009d	- remove unused manual http KeepAlive config (reducing references to obsolete httpdemon) - add port info to settings_http	11 years ago
Michael Peter Christen	5746aae3db	add canonical links to the same crawldepth, not the next crawldepth	11 years ago
Michael Peter Christen	74ab5ef9fa	increased runtime for postprocessing query job	11 years ago
Michael Peter Christen	8b32dd5f9e	special strategy for balancer: do not remove targets with zero wait time from the queue	11 years ago
Michael Peter Christen	9c6228d948	fix for deadlocks in crawler	11 years ago
Michael Peter Christen	10cf8215bd	added crawl depth for failed documents	11 years ago
Michael Peter Christen	7fefebaeca	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	c2f62e783f	- better subgraph handling, less overhead for crawls without the webgraph - usage of crawler crawldepth cache for the linkgraph target depth computation	11 years ago
Michael Peter Christen	06afb568e2	new Strategies in Balancer: - doublecheck cache now records the crawl depth as well - doublecheck cache is available from the outside (made static) - no more need to crawl hosts with lowest depth first, instead all hosts which have only singleton entries are preferred to reduce the number of files.	11 years ago
Michael Peter Christen	1aea01fe5b	fix for Table in case that requested file does not exist and paths also do not exist	11 years ago
reger	710054bb37	implement gzip input handling directly in defaultservlet (making reference to legacy httpdemon obsolete)	11 years ago
Michael Peter Christen	9a5ab4e2c1	removed clickdepth_i field and related postprocessing. This information is now available in the crawldepth_i field which is identical to clickdepth_i because of a specific crawler strategy.	11 years ago
Michael Peter Christen	da86f150ab	- added a new Crawler Balancer: HostBalancer and HostQueues: This organizes all urls to be loaded in separate queues for each host. Each host separates the crawl depth into it's own queue. The primary rule for urls taken from any queue is, that the crawl depth is minimal. This produces a crawl depth which is identical to the clickdepth. Furthermorem the crawl is able to create a much better balancing over all hosts which is fair to all hosts that are in the queue. This process will create a very large number of files for wide crawls in the QUEUES folder: for each host a directory, for each crawl depth a file inside the directory. A crawl with maxdepth = 4 will be able to create 10.000s of files. To be able to use that many file readers, it was necessary to implement a new index data structure which opens the file only if an access is wanted (OnDemandOpenFileIndex). The usage of such on-demand file reader shall prevent that the number of file pointers is over the system limit, which is usually about 10.000 open files. Some parts of YaCy had to be adopted to handle the crawl depth number correctly. The logging and the IndexCreateQueues servlet had to be adopted to show the crawl queues differently, because the host name is attached to the port on the host to differentiate between http, https, and ftp services.	11 years ago
Michael Peter Christen	075b6f9278	refactoring of the crawl balancer: the balancer is turned into an interface and the old balancer class is moved into LegacyBalancer to make room for a fresh implementation of a crawl balancer.	11 years ago
Michael Peter Christen	8470dfe3f8	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
reger	46016fa153	autoupdate fails to download latest release (1.71) due to default release blacklist - removed the default version blacklist regex from init (for future versions) !!! left existing update blacklist setting untouched !!! (existing installation wanting autoupdate for 1.71 need to change blacklist in ConfigUpdate_p.html) - moved old blacklist patch to migration.java	11 years ago
Michael Peter Christen	8aeef73d49	fix for virtual root nodes	11 years ago
Michael Peter Christen	7c7fbb9818	find depth-matches also for edge targets	11 years ago
Michael Peter Christen	dd12dd392f	introduction of a data structure for HyperlinkEdges which should use less memory as it does no double-storage of source links for each edge of the graph.	11 years ago
Michael Peter Christen	6ea8bb7348	using MultiProtocolURL for edge data which is faster (hash computation is now much easier) and smaller in size	11 years ago
Michael Peter Christen	b21c208b4d	enhanced hashcode computation for MultiProtocolURL	11 years ago
Michael Peter Christen	ce1d1b2fa0	fix for maximum tag length in parser	11 years ago
Michael Peter Christen	17e0956312	refactoring of SystemLoad calls (only one backend tool)	11 years ago
Michael Peter Christen	a37d067692	refactoring	11 years ago
orbiter	95780eed32	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
Michael Peter Christen	67beef657f	strong redesign of html parser: object recursion is now made using a stack on html tag objects, not using a recursive parse-again method which may cause bad performance and huge memory allocation. The new method also produced better parsed image objects with exact anchor text references.	11 years ago
Michael Peter Christen	6bd8c6f195	fix for wrong status codes of error pages	11 years ago
Michael Peter Christen	9e503b3376	also delete the robots.txt file from the cache when a new crawl is started	11 years ago
orbiter	67501c9dda	Merge branch 'master' of git@gitorious.org:yacy/rc1.git	11 years ago
Michael Peter Christen	1c21b3256d	fix for robots.txt handling: delete old entry before starting a new crawl.	11 years ago
orbiter	c250fac9f4	linkstructure refactoring to get more options for clickdepth analysis	11 years ago
Michael Peter Christen	8068e68474	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	bd886054cb	new structure and enhancements for link graph computation: - added order option to solr queries to be able to retrieve document lists in specific order, here: link length - added HyperlinkEdge class which manages the link structure - integrated the HyperlinkEdge class into clickdepth computation - extended the linkstructure.json servlet to show also the clickdepth and other statistic information	11 years ago
reger	f326a67561	fix: typo in default charset in metadata2solr update pom and NB build to Solr 4.7.1 libs	11 years ago
Michael Peter Christen	df138084c0	do solr optimization independently from memory and load constraints: - not doing an optimization will likely cause a too many files exception - without optimization performance will be even worse which would prevent optimization in the future as well (prevent a deadlock situation)	11 years ago
Michael Peter Christen	ebd44a7080	replaced solr 4.6.1 with solr 4.7.1 and added index migration to lucene_47	11 years ago
Michael Peter Christen	734778c0c8	fixed a time-out problem in the default servlet which is also a logging problem because the error log showed the wrong reason (file not found) instead the actual reason (time-out).	11 years ago
Michael Peter Christen	466d90ad42	fixed a problem with resource observer; probably coming from uncatched exceptions within the apache library which appear only in concurrency environments.	11 years ago
Michael Peter Christen	e8ddd415a8	enhanced the new link structure graph	11 years ago
Michael Peter Christen	926d28dd3f	fixed a bug which prevented crawl starts after a network switch	11 years ago
Michael Peter Christen	3ce8eff21b	another fix for inbound/outbound detection	11 years ago
Michael Peter Christen	d4b5c457e4	NPE fix	11 years ago
Michael Peter Christen	36a66b0704	fix for parsing of numeric value in case that boolean values are given	11 years ago
orbiter	41730c8048	better logging in template engine: shows filename of servlets where errors in templates occur	11 years ago
orbiter	3c1274057d	fixed thread dump in case of wrong seeds	11 years ago
orbiter	18f9c40302	moved Edge class out of linkstructure servlet as this does not work on non-eclipse driven environments (all non-dev cases)	11 years ago
orbiter	de95e5e524	reduced search activity corona strength in network image	11 years ago
reger	da413af664	move baseurl after parsing orig source in urlproxyservlet to calculate absolute href links for rewrite from unmodified source.	11 years ago
reger	af6ad20728	fix: remove obsolete ref to yacy.home (use Switchboard instead)	11 years ago
Michael Peter Christen	74ab094587	fix for solr query size; too many documents had been retrieved in case that less than _pagesize_ had been requested.	11 years ago
Michael Peter Christen	c64c10ef00	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	48fbfa60c1	bugfix to inbound/outbound identification	11 years ago
reger	227c42bc96	eleminate obsolete URIMetaDataRow class by joining it with/into URIMetaDataNode.	11 years ago
Michael Peter Christen	cca851a417	introduced new solr field crawldepth_i which records the crawl depth of a document. This is the upper limit for the clickdepth_i value which may be shorter in case that the crawler did not take the shortest path to the document.	11 years ago
orbiter	b1ba764d81	fix for first start options and added german translation for popup texts	11 years ago
orbiter	429a874222	- added COLS field in GSA response (non-gsa standard by customer request) - updated document link in GSA response writer	11 years ago
Michael Peter Christen	1b9ec9a1c5	- added popover to p2p/stealth mode button to explain the peer mode and privacy issues. - added popover to first-time use case to explain that specific servlets are only visible after customization and/or crawl starts	11 years ago
Michael Peter Christen	62a36fa584	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
reger	c9f92abddc	fix: application link count (URIMetadataNode)	11 years ago
Michael Peter Christen	a267c46e1a	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	5b83887da8	npe fix	11 years ago
Michael Peter Christen	63c9fcf3e0	free configuration of postprocessing clickdepth maximum depth and time	11 years ago
Michael Peter Christen	39b641d6cd	added tutorial mode - some menu items will only appear if you 'qualify' for them. Thus, the first-time user will only see four menu items. The other items will unfold as the user interacts.	11 years ago
sixcooler	f06775850f	fix receiving DHT / parse pultipart + another close to fix possible resource leak warning	11 years ago
reger	49e76a1c55	make use of detected charset in htmlParser if none is given.	11 years ago
reger	e11504309f	adding a hint to javascript browser short cut on Url-Proxy page (AugmentedBrowsing_p.html)	11 years ago
reger	b12200cafe	alternative UrlProxyServlet (for /proxy.html) using different url rewrite rules - use JSoup parser for selective rewrite of html body <a href= links only, instead of regex which rewrites also header href/src links - this improves display of pages which use header <base> tag - tags with src attribute are taken from original location (like css) improving display and are not routed trough the indexer Disadvantage: scripting links will drop out of proxy Setting of the servlet through web.xml exclusivly (in case one would like to quickly switch back to the YaCyProxyServlet, leaving the existing code of YaCyProxyServlet untouched available)	11 years ago
reger	2953ebe701	fix: port in local target adress & button style	11 years ago
Michael Peter Christen	fda591695c	fixed visibility of custom icon	11 years ago
Michael Peter Christen	a9b9950d7f	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
Michael Peter Christen	b488f33975	added close to fix possible resource leak warning	11 years ago
Michael Peter Christen	56710ecb26	prevent opening of new files as that could be a cause for the latest too-many-open-files exception. The old file is just truncated if the table is cleaned.	11 years ago
Michael Peter Christen	8b44fcf0f4	added missing @Override annotation	11 years ago
reger	d7055904a6	fix: proxyservlet path header setting	11 years ago
Michael Peter Christen	e515dd460d	added linkscount_i and linksnofollowcount_i to the default solr schema	11 years ago
Michael Peter Christen	1a764135be	one more Thread Dump fix for new bootstrap css style	11 years ago
Michael Peter Christen	bb21d825f9	fix for thread dump line spacing	11 years ago
Michael Peter Christen	cbdfef7ce1	changed protocol facet to show also all other counts if one facet is selected	11 years ago
reger	b9056ef2db	remove unused private header entries (HeaderFramework) X_YACY_ORIGINAL_REQUEST_LINE X_YACY_KEEP_ALIVE_REQUEST_COUNT CONNECTION_PROP_REQUESTLINE	11 years ago
sixcooler	6d16fa993d	make transparent proxy handle https-connections: the implemented handle for connect did not work for me - so lets try the connectHandler	11 years ago
Michael Peter Christen	61ad194065	fix for source and target clickdepth in webgraph index	11 years ago
Marc Nause	809b4e1fd9	Team added support for URLs with unicode characters in host part to blacklist. Punycode is used to handle unicode characters.	11 years ago
reger	b126b9ba17	add some InputFileStream close at end of reads to make sure file is released	11 years ago
reger	ca7444dbdf	limit filetype nav to known extension also on image/media search - on text search we limit filetype nav already to known extension, apply filter to image search	11 years ago
reger	651d057e93	surrogate import translate dc:language 3-char codes OAI records often use 3-char language codes, start converting some 3-char lang's to the internal ISO639-1 2-char code	11 years ago
orbiter	22618e3ba2	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
orbiter	01989f6af9	restrict write buffer size to a limit	11 years ago
Michael Peter Christen	d1091e79f8	- added stealth button to navigation menu - more fixes to progress bar	11 years ago
reger	c297de5145	remove check for unused virtual path /currentyacypeer/ - del jqueryheader.template (not used)	11 years ago
orbiter	3c8d6e1eee	added adminAccount switch to ConfigAccounts_p servlet to switch on protection of all pages; some refactoring as well	11 years ago
orbiter	7d24bcb98d	added flag to require that all web pages, even such without a "_p" extension require authorization. (default off)	11 years ago
Michael Peter Christen	7a6658abec	removed synchronization in embedded solr connection (that was probably a mistake?)	11 years ago
Michael Peter Christen	a7d4379ef9	fixed shutdown of solr cores in case that more than one local core is to be closed (this happens if webgraph is enabled and the index is dumped using /IndexControlURLs_p.html	11 years ago
Michael Peter Christen	453bfd0f17	removed unused variables and warnings	11 years ago
Michael Peter Christen	05655d98df	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	11 years ago
reger	9f02d2c47b	fix: remove link to triplestore in Vocabulary_p (triplestore does not longer exist) - should be investigated in more detail to look for additional implications Remove "yacyaction" from proxyservlet as it was only needed for removed interaction routines.	11 years ago
reger	81a846ec33	fix: set YaCy CONNECTION_PROP_HOST Header in ProxyServlet to host incl. port	11 years ago
reger	251be9ecfa	remove unused ProxySettings ref. from loader clean unused whois test code	11 years ago
reger	82dc815af9	cleanup: remove unrelated and unused code	11 years ago
Michael Peter Christen	85a427ec54	support for multiple sitemaps in robots.txt	11 years ago
reger	a373fb717d	remove more unused from legacy server.http - triggerOnlineAction not used - useTemplateCache not used	11 years ago
reger	749d020aeb	remove redundant url string manipulation in HTTPDProxyHandler (still used by ProxyServlet)	11 years ago
reger	612294cf84	use servletPath in ProxyServlet instead of fixed name to allow servlet-mapping via web.xml	11 years ago
reger	1d01672bd3	fix DCEntry.getIdentifier on successful url parameter	11 years ago

... 9 10 11 12 13 ...

8012 Commits (fbbfeeb31397588b0b37b482c28a2b3da5cf2f22)