Commit Graph

76 Commits (49cab2b85f58542aeac6b8b4f19b3df595a87bbf)

Author SHA1 Message Date
Michael Peter Christen 659178942f - Redesigned crawler and parser to accept embedded links from the NOLOAD
13 years ago
Michael Peter Christen 33d1062c79 refactoring: the cache belongs to the crawler
13 years ago
reger a95f645a61 Bugfix class repository.Loaddispatcher fixed download file limit of 10000
13 years ago
Michael Peter Christen ef5192f8c9 using the generic document parser for crawl starts instead of the html
13 years ago
Marek Otahal f40efb39af Blacklist loadList() remove duplicates by using Set
13 years ago
Michael Christen eebc02f5c1 fix
13 years ago
Michael Christen 216a287a85 Merge commit '6d4e08ed06c5cd28c45981b2ebe31c7f7ec6fd83' into quix0r
13 years ago
Michael Christen 361146dd7a better error handling for file loader
13 years ago
Roland 'Quix0r' Haeder 6d4e08ed06 Rewrote filesize() to (hopefully) avoid a NPE, rewrote Blacklist class to concurrent classes to avoid a CME
13 years ago
Roland Haeder 319fd1f4aa A concurrent access can happen on the blacklist (with latest introduced blacklist check in media snippet computation)
13 years ago
Roland 'Quix0r' Haeder a3083d13bf Blacklist checks are now always turned on, in media searches (e.g. image search) images matching blacklist entries are no longer shown to the user
13 years ago
orbiter e22f8497c9 - tested the ARC methods
13 years ago
orbiter 5a55397f99 some last-minute performance hacks
13 years ago
orbiter d2ea250d99 refactoring:
13 years ago
orbiter 49e5ca579f added new configuration property "crawler.embedLinksAsDocuments". If this is switched on (this is default now), the all embedded image, audio and video links from all parsed documents are added to the search index as individual document. This will increase the search index size dramatically but will also enable us to create a much faster image, audio and video search. If the flag is switched on, the index entries are also stored to a solr index, if this is also enabled.
13 years ago
orbiter 1c007188ad bugfixes in html parser
13 years ago
sixcooler 5f8a5ca32d - not doing merge-jobs while short on Memory
13 years ago
orbiter 2c58af6874 - added a short memory status simulation mode
13 years ago
sixcooler 59b767eebd stop loading via http at defined maximum of bytes - even size is unknown before loading
13 years ago
low012 c7b95e8c81 *) Invalid crawl profiles (containing invalid mustmatch/mustnotmatch filters) will be moved from active crawls to invalid crawls (new file: DATA/INDEX/freeworld/QUEUES/crawlProfilesInvalid.heap). This file can not be edited yet, but it shoudl be easy to extend the CrawlProfileEditor accordingly.
14 years ago
orbiter 115abc8917 - more attributes for search progress bar
14 years ago
orbiter 0c1b29f3c9 - applied many small performance hacks
14 years ago
orbiter 4bea3f9714 hack to reduce resource contention caused by massive UTF8 decodings which use java.nio resources:
14 years ago
orbiter d8e934c085 better abstraction of http client identification
14 years ago
orbiter 96c32e87b0 fixes to crawler and new user-agent crawl-delay handling
14 years ago
orbiter b1a8d0c020 enhancements to web cache and less strict caching rules
14 years ago
orbiter 30aed9824a moved getBytes() to UTF8.getBytes() to use a default String encoding
14 years ago
low012 3b40b98256 *) set SVN properties
14 years ago
orbiter 4588b5a291 - fixed document number limitation for crawls that restrict the number of documents per domain
14 years ago
low012 74b22dfa24 *) fixed bug which affected blacklist entries which consisted of domain _and_ path parts
14 years ago
f1ori a321c7673d * adminAccountForLocalhost only for localhost
14 years ago
orbiter 6c1b14c8e1 - more control in access tracker: count number of returned search results (not only info how much is in the index)
14 years ago
f1ori 9d2159582f * fix system update if urls are in blacklist (for example for very general blacklists like *.de)
14 years ago
orbiter a563b05b60 enhanced crawler:
14 years ago
low012 9b3fae9496 *) cleaning up the code a little bit
14 years ago
orbiter 321eb012fe removed two warnings and reverted one change
14 years ago
low012 eb79b952ef *) cleaner code
14 years ago
f1ori d62e449a11 * fix FilterEngine, forgot comparision-operator
14 years ago
f1ori def4253555 * add option to network definition to provide a domainlist (syntax like in blacklists)
14 years ago
orbiter aacf572a26 - enhancements for search speed
14 years ago
orbiter d2fd93135c - moved yacybot user agent string definition to MultiProtocolURI since there are basic access mechanisms where the bot string is needed
14 years ago
f1ori 8fe1102452 fix http://forum.yacy-websuche.de/viewtopic.php?p=20889#p18426
14 years ago
orbiter 24502fe3de performance hacks
14 years ago
orbiter ffaa9a1c51 avoiding double-loading of the same resource from the web in case that a seond attempt to load the resource is started while the first attempt is still loading the content from the web. This will delay the second attempt to the time when the first attempt has finished with the possible result that the second attempt reads only from the web cache, not from the web.
14 years ago
orbiter ae07e11bc5 enhanced image search result display: concurrent loading of images before they are displayed
14 years ago
orbiter 65eaf30f77 redesign of crawl profiles data structure. target will be:
14 years ago
orbiter 3197ca42ed preparations to move the HTCache into cora:
14 years ago
orbiter 22dbbcfa56 better (and corrected) recognition of intranet and internet-addresses. This corrects the isLocal property that is used by network definitions to restrict index ranges to local and global addresses. Address locations (intranet or internet) had been partly identified by the top level domain of the host address. Since intranet addresses can also be addressed using a host name that is in a country domain it is necessary to do a dns resolving for each check. The check is supported by a local dns cache so the intranet/internet check should not affect network traffic too much. To ensure that the cache works properly the cache class was upgraded to better concurrency data structures.
15 years ago
orbiter b6fb239e74 redesign of parser interface:
15 years ago
orbiter 777195e8d1 more abstraction for access of LoaderDispatcher and cache
15 years ago