Commit Graph

174 Commits (bbcd9441bc9108802231bf452e547f96825b9253)

Author SHA1 Message Date
orbiter 9c681cc00d added segment sizes, postprocessing status and cpu load to crawler
11 years ago
Roland Haeder 841a28ae76 Added 'final' for all exception blocks as this helps the Java compiler
11 years ago
Michael Peter Christen 89c0aa0e74 added collection_sxt to error documents
11 years ago
Michael Peter Christen bcc623a843 refactoring of load_delay: this is a matter of client identification
12 years ago
Michael Peter Christen 5878c1d599 - refactoring of log to ConcurrentLog:
12 years ago
Michael Peter Christen 57ffdfad4c added a crawl option to obey html-meta-robots-noindex. This is on by
12 years ago
Michael Peter Christen f1c5338210 prepartion for greedy crawl profiles and refactoring
12 years ago
Michael Peter Christen 8f2d3ce2f9 reduced locking situation in crawler: shifted synchronized location and
12 years ago
Michael Peter Christen f93501e6e0 nice crawl name if crawl is started with file:// (was: null)
12 years ago
Michael Peter Christen b24d1d18e4 removed synchronization and concurrency in Fulltext class, concurrent
12 years ago
Michael Peter Christen e26bdd4a52 fixes to deletion methods (removed unnecessary concurrency and added
12 years ago
Michael Peter Christen cca19d94d4 re-declared some fields to be of type string rather than text which
12 years ago
Michael Peter Christen 25499eead5 - added a new field for the regular expression in crawl start
12 years ago
orbiter 2c3b024196 if the crawl was paused (automatically), show the reason for pausing in
12 years ago
Michael Peter Christen 788288eb9e added the generation of 50 (!!) new solr field in the core 'webgraph'.
12 years ago
Michael Peter Christen 91a0401d59 introduced a second core named 'webgraph'. This core will hold the link
12 years ago
Michael Peter Christen 0b6566a389 optimizations when starting large crawl requests with many start urls in
12 years ago
Michael Peter Christen be27567b53 allow more links when starting a crawl by file
12 years ago
Michael Peter Christen 0fe7b6fd3b migrated the index export methods from the old metadata to solr. Now
12 years ago
Michael Peter Christen 4735bd47f4 - changed solr commit call and added an optimize option. Since Solr
12 years ago
Michael Peter Christen fb0fa9a102 - fixed 'delete from subpath' during crawl start which deleted nothing;
12 years ago
Michael Peter Christen eca68fa197 added debug code to crawler monitor
12 years ago
Michael Peter Christen 5fd3b93661 added deletion of hosts during crawl start if deleteold option was given
12 years ago
orbiter b55ea2197f - redesign of crawl start servlet
12 years ago
orbiter 1c66de4bd4 - removed scheduled crawling options in crawl start because it is
12 years ago
Michael Peter Christen 6244b084cd fixed wrong order of result count values
12 years ago
Michael Peter Christen 15d1460b40 added information about the reason of pausing of crawls
12 years ago
Michael Peter Christen 2371ef031c added solr faceted search support to YaCy search results
12 years ago
Michael Peter Christen 791e1dcfdf when a new crawl is started, delete all entries about error-urls for
12 years ago
Michael Peter Christen 5e77801aac update to web interface structure
12 years ago
orbiter 354ef8000d - added 'deleteold' option to crawler which causes that documents are
12 years ago
Michael Peter Christen f8f05ecba7 - added a delete button in host browser to delete a complete subpath
12 years ago
Michael Peter Christen ac9540dfb6 removed options for stopwords which are not used
12 years ago
Michael Peter Christen 85ca07b90e when a new crawl is started, an equal crawl, if still running, is
12 years ago
Michael Peter Christen ae6feb5610 showing the web structure graph as animation in the crawl monitor
12 years ago
Michael Peter Christen 21fe8339b4 - enhanced generation of url objects
12 years ago
Michael Peter Christen 5f0ab25382 removed the option to prevent removal of & parts inside of the
12 years ago
Michael Peter Christen 53789555b9 fix for crawl start filter
12 years ago
Michael Peter Christen abebb3b124 added a crawl start checker which makes a simple analysis on the list of
12 years ago
orbiter ae246c30c3 fixed interpretation of directDocByURL attribute during crawl start
12 years ago
sixcooler c65b576a6f added filename for missing crawlname when crawling from file
12 years ago
Michael Peter Christen 1533bfd63b refactoring
12 years ago
Michael Peter Christen 00c1c777fa refactoring
12 years ago
orbiter 60b1e23f05 added new crawl options:
12 years ago
Michael Peter Christen 6ec02deec6 added new crawl attributes in crawl profile (not active yet)
12 years ago
Michael Peter Christen a13e5153ac - added the possibility to have not one but a list of crawl start urls
12 years ago
Michael Peter Christen 9644c186a4 added search functionality to ViewFile.html servlet
12 years ago
Michael Peter Christen b2b516cc3e added a collection attribute to crawls and searches:
12 years ago
Michael Peter Christen 0cab06c47c refactoring
12 years ago
Michael Peter Christen 24d9db1613 snippet retrieval loading processes may use a smaller minimum load time
12 years ago
Michael Peter Christen 1687737771 Abstraction of HandleMap and HandleSet
12 years ago
Michael Peter Christen e3aa05b9dd added creation of subpath pattern when crawl start is 'from file'
13 years ago
orbiter 0cbda0b2b8 - replaced all length() == 0 and size() == 0 with isEmpty()
13 years ago
Michael Peter Christen 7c1ba99755 removed more unused method parameters
13 years ago
Michael Peter Christen 0301aba1e9 removed unused method parameters
13 years ago
Michael Peter Christen d3964253ae - added @SuppressWarnings to unused servlet method parameters
13 years ago
Michael Peter Christen 276a66a793 Adding a limit of 1000 links that a parser shall store during indexing.
13 years ago
Michael Peter Christen 1825f165b8 better integration of blacklist according to use case
13 years ago
Michael Peter Christen 03280fb161 removed segments-concept and the Segments class:
13 years ago
Michael Peter Christen 9116013c64 - allow lazy initialization of solr value (if using 'lazy', then no
13 years ago
Michael Peter Christen 77f795756c fixing redirects and status codes: storing of status code in
13 years ago
Michael Peter Christen d7eb18cdf2 accept also file names beginning with "file://" for crawl start from
13 years ago
Michael Peter Christen 16b21f7a5b Added more steering in Crawler_p.html interface
13 years ago
Michael Peter Christen 19efbf1b0f - apply directDocByURL to NOLOAD Queue
13 years ago
Michael Peter Christen ef5192f8c9 using the generic document parser for crawl starts instead of the html
13 years ago
Michael Peter Christen 992dbdf4bb added noload statistic to servlets
13 years ago
orbiter 11729061f2 added an option in the bookmark import process to put everything into the crawler
13 years ago
orbiter 5a55397f99 some last-minute performance hacks
13 years ago
orbiter da55a359e9 addon to http://bugs.yacy.net/view.php?id=72
13 years ago
apfelmaennchen 564374d1fe - included YMarks in addition to old bookmarks in yacysearchitem.html; don't get confused by the old bookmark dialog, the ymark is automatically added silently beforehand.
13 years ago
orbiter d449547023 fix for http://bugs.yacy.net/view.php?id=72
13 years ago
orbiter c93f10417a add a bookmark automatically each time a new crawl is started
13 years ago
orbiter e4a82ddd8b produce a bookmark entry from every crawl start. these bookmarks are always private.
13 years ago
orbiter 42425c8003 fixed directDocByURL (has now effect if switched off)
13 years ago
orbiter a7df70221e refactoring
13 years ago
orbiter cf4fd525ee added directDocByURL attribute in crawl profile
13 years ago
orbiter b250e6466d implemented crawl restrictions for IP pattern and country lists
13 years ago
orbiter 5ad7f9612b added crawl settings for three new filters for each crawl:
13 years ago
orbiter d2ea250d99 refactoring:
13 years ago
orbiter 115abc8917 - more attributes for search progress bar
14 years ago
orbiter 10e2f588f8 - enhanced ybr ranking computation
14 years ago
orbiter 6fa439c82b - refactoring of robots
14 years ago
orbiter b77b8cac0c - enhanced html parser: recognized much more details in the content
14 years ago
orbiter 3d5104d357 - fixed a bug in crawl start with file name (npe in new url)
14 years ago
orbiter 156cf02703 - added an index constraint 'has location' to the condenser
14 years ago
orbiter 43e1660512 fix/enhancement in Crawler: do not generate domain match pattern if crawl depth is 0
14 years ago
low012 2861d0888a *) simplified code\n*) fixed potential NumberFormatExceptions
14 years ago
orbiter 7962d35425 - removed file upload function in crawl start and replaced it with an input field for a file path where the crawl start file is loaded. This was necessary to support the API steering for file crawl starts, for two reasons:
14 years ago
orbiter 4588b5a291 - fixed document number limitation for crawls that restrict the number of documents per domain
14 years ago
low012 ae10ed5613 *) added a Set to which filter elements are written before mustmatch-filter is created to avoid huge lists of double elements in mustmatch-filter when starting a crawl from a "Link-List of URL" on CrawlStartSite_p.html
14 years ago
orbiter c93f4dda72 - cleaned up yacy news
14 years ago
orbiter 0769f4caa6 added search suggestions for interactive search: is only shown if there are no search results
14 years ago
orbiter 58b59f9bc8 - a collection of bug fixes and some redesign of the Scanner class
14 years ago
orbiter a563b05b60 enhanced crawler:
14 years ago
orbiter c36da90261 added a very fast ftp file list generator to site crawler:
14 years ago
low012 e7552bd719 *) cleaning up the code a little bit
14 years ago
low012 38fdf43587 *) renamed classes according to standard Java coding conventions
14 years ago
f1ori 7d8de34778 * add a bit documentation to DigestURI, use DigestURI(string) instead of DigestURI(string, null)
14 years ago
orbiter fcd40cd30f - disabled domZones (buggy, must think about better solution)
14 years ago
orbiter c3bf17a3a1 fixed must-match filter for smb crawling
14 years ago