- the list of urls is entered in the expert crawl start in a textfield;
the one-line input field was replaced with a text box
- start urls can also be given in one single line where the urls are
separated by a '|'-character
- as an effect, the crawl profile cannot carry a single start url for
identificaton because it is possible to have more. Therefore the url was
removed from the crawl profile
- this affect all servlets which display a crawl profile: removed the
url field from all there servlets
- to work consistently with several start urls and the other crawl
starts which computed crawl start url lists from sitelists or sitemaps,
the crawl start servlet was restructured completely
- new rules for must-match patterns were created to make it possible
that site crawl starts also work with several crawl starts at once
changes of API schedule row data changed form input form to unique field names
using row pk.
Fix for issue 96 http://bugs.yacy.net/view.php?id=96
IE9-64bit doesn't interprete iframe with align parameter as desired
misaligns following content (in CrawlProfileEditor_p.html)
- nobody understand the auto-dom filter without a lenghtly introduction about the function of a crawler
- nobody ever used the auto-dom filter other than with a crawl depth of 1
- the auto-dom filter was buggy since the filter did not survive a restart and then a search index contained waste
- the function of the auto-dom filter was in fact to just load a link list from the given start url and then start separate crawls for all these urls restricted by their domain
- the new Site Link-List option shows the target urls in real-time during input of the start url (like the robots check) and gives a transparent feed-back what it does before it can be used
- the new option also fits into the easy site-crawl start menu
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7213 6c8d7289-2bf4-0310-a012-ef5d649a1542
- every process that is monitored with the API Steering interface can now be scheduled!
- added input methods in Steering interface to set a scheduling time
- added a view on the steering api that shows only crawl jobs inside the Crawl Profile servlet
- added a scheduling call process in the cleanup process handler that triggers the scheduled processes
This causes that the cleanup now also looks for scheduled processes. Such processes are therefore not executed at
the same time as given in the target execution time but they will be executed within the cleanup process time window.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7050 6c8d7289-2bf4-0310-a012-ef5d649a1542
- moved all index generation servlets to it's own main menu item, including proxy indexing
- removed external index import because this operation is not recommended any more. Joining an index can simply be done by moving the index files from one peer to the other peer; they will be merged automatically
- fix to prevent endless loops when disconnecting http sessions
- fix to prevent application of bad blacklist entries that can cause a 'Dangling meta character' exception
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6558 6c8d7289-2bf4-0310-a012-ef5d649a1542
- removed never-used secondary crawl depth
- added a must-not-match filter that can be used to exclude urls from a crawl
- added stub for crawl tags which will be used to identify search results that had been produced from specific crawls
please update the yacybar: replace property name 'crawlFilter' with 'mustmatch'.
Additionally, a new parameter named 'mustnotmatch' can be used, which should be by default the empty sring (match-never)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5342 6c8d7289-2bf4-0310-a012-ef5d649a1542
the history distinguishes between different users and identifies them by their ip
a history is only shown to the user who submitted the search
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4510 6c8d7289-2bf4-0310-a012-ef5d649a1542
- the index administration now uses the same code base for url selection and collection
as the search interface. The index administration is therefore a good test environment for
ranking order control
- removed old postsorting-algorithms, will be replaced with new one
- fixed many bugs occurred before during ranking; especially the contraint filtering method
removed too many links
- fixed media search flags; had been attached to too many urls. The effect should be a better
pre-sorting before media load within snippet fetch
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4223 6c8d7289-2bf4-0310-a012-ef5d649a1542
- removed web structure picture from indexing menu and grouped it together with htcache monitor
- added a database for terminated crawls, when a crawl is finished it is automatically moved to the new database
- extended crawl profile edit servlet, shows now also terminated crawls
- option that was used to delete profiles is now redesigned to a function that moves the current crawl to the terminated crawls and removes all urls from the current queues!
- fixed here and there problems with indexing queues
- enhances indexing speed by changing cache flush sizes.
- changed behaviour of crawl result servlet: the list of crawled urls is shown if there is one, othevise the overview window is shown
attention: the new profile databases are not compatible with the old one. current crawls will be lost! the web index is not touched.
next steps: the database of terminated crawls can be used to start with them a new crawl. This is useful if one wants to re-crawl specific pages and wants to use a old crawl profile.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@4113 6c8d7289-2bf4-0310-a012-ef5d649a1542