Michael Peter Christen
enhanced exists()-method for solr; should reduce a lot of IO during DHT
target selection
12 years ago
Michael Peter Christen
added a Boost class which stores solr query boost values. The class can
be configured using the yacy.init file. The boost information is taken
from the configuration each time when a query to solr is done.
12 years ago
Michael Peter Christen
added more logging to get info which url causes performance problems
12 years ago
fix: prevent regex pattern compile error for blacklist import for path '*' (extend it to '.*')
12 years ago
prevent Solr "version conflict" on update by set Solr "_version_" field to 0 (=no version check)
12 years ago
Michael Peter Christen
improvements in GSA result writer
12 years ago
Michael Peter Christen
replaced more split and replaceAll missing pattern pre-compilation with
pre-compiled pattern
12 years ago
Michael Peter Christen
using more pre-compile pattern for split methods
12 years ago
Michael Peter Christen
enhanced search result processing behavior
- query less at one time; query more often
- in between the small queries, evaluate results
- remove fields from search results which are not needed
12 years ago
Michael Peter Christen
Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1
12 years ago
fix: display and calculate authors and namespace search navigator if configured (otherwise skip overhead)
(leave hosts, topics and not in ConfigPortal included filetype, protocoll navigator untouched)
12 years ago
Michael Peter Christen
added debug code to crawler monitor
12 years ago
Michael Peter Christen
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
12 years ago
start the local search only if this peer is doing a remote search or
when it is doing a local search and the peer is old
12 years ago
Michael Peter Christen
- removed multi-add of documents (no used)
- inserted specialized code for size request
12 years ago
Michael Peter Christen
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
12 years ago
Michael Peter Christen
- added a field cache for solr queries which call only for a single
- fixed a version conflict exception within a solr add request
12 years ago
fixes for filesystem indexing
12 years ago
Michael Peter Christen
added a new fail type attribute for the index to distinguish two
separate fail types: network fail and forced exclusion (i.e. by robots
or forwarding rules).
12 years ago
Michael Peter Christen
- added another enumeration method in kelondro data structure to get a
more random access to data for the balancer
- added random access inside the balancer
12 years ago
Michael Peter Christen
removed overhead by preventing generation of full search results when
only the url is requested
12 years ago
Michael Peter Christen
- using edismax in gsa interface
- generating less field data for gsa search results
- using a boost query in gsa interface to move double content to the end
of the result list
12 years ago
Michael Peter Christen
added a feature to find similarities in documents.
This uses an enhanced version of the Nutch/Solr TextProfileSignatue.
As a result, a signature of the document is written to the solr search
index. Additionally for each time when a signature is written, it is
checked if the singature exists already in the index. If the signature
does not exist, the document is marked as unique. The unique attribute
can now be used to sort document lists and bring duplicates to the end
of a result list.
To enable this, a large portion of the search api to Solr had to be
changed. This affected mainly caching of 'exists' searches to enhance
the check for existing signatures and do this without actually doing a
solr query.
Because here the first time a long number is used as value in the Solr
store, also the value naming in the YaCySchema had to be adopted and
normalized. This caused that many files had to be changed.
12 years ago
Michael Peter Christen
- added field options to all solr queries. This can be used to restrict
the actual data which is fetched from solr.
- used the new field options to reduce generic options like getting the
load date or the count of search results. should increase overall speed
- used the new field options to reduce overhead in the host browser
during aquisition of links.
- used the field options to make checking of links in crawler faster
- if the crawler is paused, the crawl queue is not cleaned
12 years ago
Michael Peter Christen
Merge commit '2bb8f045cc92f31fc7e720cc30b38af417563890'
12 years ago
Michael Peter Christen
Merge remote-tracking branch 'regerdev/master'
12 years ago
Michael Peter Christen
FINALLY YaCy can now search for full strings using double- or
singlequoted strings in the search query line!!!
12 years ago
redesign of the QueryParams class: introduced QueryGoal which holds the
query string parser. This shall be used to create a proper full-string
matching which is handled then by QueryGoal.
12 years ago
content control: use up-to-date definitions
13 years ago
Michael Peter Christen
added deletion of hosts during crawl start if deleteold option was given
13 years ago
Michael Peter Christen
because we have the inurl:<term> - searchmodifier, we don't actually
need regular expressions as search attributes. They had now been removed
from the advanced search page while they are still created internally.
The filter is then expressed against solr as regular expression filter
query. If the expression points out a selection of an specific protocol,
host or filetype this is then translated into a facetted query.
13 years ago
SMW Import: replaced JSON import routines with stable ones
13 years ago
refactor package
13 years ago
remove old SMW importer which was part of the ymarks package
13 years ago
update and generalization of the SMW import and content control routines
13 years ago
Michael Peter Christen
fixed media search
13 years ago
Michael Peter Christen
removed warnings, removed too-fast pausing of crawls
13 years ago
Michael Peter Christen
added matching of path to query pattern
13 years ago
Michael Peter Christen
fixed a problem with non-terminating crawls
13 years ago
Michael Peter Christen
fix to ftp client
13 years ago
Michael Peter Christen
update to search result logging (this was a remaining issue from the
solr 4.0.0 migration)
13 years ago
Michael Peter Christen
fix for filetype naviagtor
13 years ago
Michael Peter Christen
bugfixes for crawler
13 years ago
Michael Peter Christen
fixed npe for surrogate import
13 years ago
Michael Peter Christen
more logging
13 years ago
Michael Peter Christen
automatically delete entries from the crawl profile list if crawl is
13 years ago
Michael Peter Christen
added information about the reason of pausing of crawls
13 years ago
Michael Peter Christen
added solr faceted search support to YaCy search results
added solr highlighting / YaCy snippets to YaCy search results
- facets are now much more complete
- facets are computed and searched much faster
- snippet computation is done by solr if solr knows the snippet
13 years ago
Michael Peter Christen
added more thread-renaiming for search processes
13 years ago
Michael Peter Christen
set the thread name during solr queries to the solr query to get better
debugging options
13 years ago
Michael Peter Christen
added the visualization of error-urls to host browser
- only visible for admins
- a faceted search generates a huge list for all hosts in the host list
- the faceted search algorithms had to be modified for that
- within the browsing of the directory path, the error cause is written
to the url which is presented as error-url
- the errors are also accumulated for directory sums
13 years ago
Michael Peter Christen
fix for some interface problems
13 years ago
Michael Peter Christen
when a new crawl is started, delete all entries about error-urls for
crawl-start domains
13 years ago
Michael Peter Christen
fixed filetype modified for media types in text search
13 years ago
Michael Peter Christen
automatically pause the crawler if there is a problem with solr
13 years ago
Michael Peter Christen
renovated the way how search results are count. should be correct now...
13 years ago
Michael Peter Christen
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
13 years ago
Michael Peter Christen
Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1
13 years ago
- added 'deleteold' option to crawler which causes that documents are
deleted which are selected by a crawl filter (host or subpath)
- site crawl used this option be default now
- made option to deleteDomain() concurrency
13 years ago
Fix Metadata handling
- language default on missing lang property to "uk" (fix set to nothing)
- language set to TLD (added call to existing language calculation from TLD)
- coordinate number exception on possible lat/lon content of "NaN,NaN"
adjust Netbeans IDE classpath (for Solr/Lucene 4.0.0 jars)
13 years ago
Michael Peter Christen
update to HostBrowser:
- time-out after 3 seconds to speed up display (may be incomplete)
- showing also all links from the balancer queue in the host list (after
the '/') and in the result browser view with tag 'loading'
13 years ago
Michael Peter Christen
migration to solr 4.0.0
13 years ago
Michael Peter Christen
code cleanup
13 years ago
Michael Peter Christen
- fixed the delete option in host browser
- added a delete method which can be used to delete a full subpath in
13 years ago
Michael Peter Christen
added the MIME attribute for the R tag in GSA search result writer
13 years ago
Michael Peter Christen
more refactoring - integrated the code of SnippetProcess into
13 years ago
Michael Peter Christen
tried to clean up the search process mess
13 years ago
Michael Peter Christen
fixed a problem with local search from solr results: now all results
from solr are shown (again)
13 years ago
Michael Peter Christen
- added a delete button in host browser to delete a complete subpath
- removed storage of default collection name - default is now "user"
- made stacking of crawl start points concurrently
13 years ago
Michael Peter Christen
added more / all new crawl profile fields into crawl profile editor
13 years ago
Michael Peter Christen
in case that a crawl profile has a collection assigned, use the
collection to show a name in the web interface. This should prevent that
much too long names make the interface unusable.
13 years ago
Michael Peter Christen
enhaced data structures for balancer and latency computation which
should produce a bit better prognosis about forced waiting times.
13 years ago
Michael Peter Christen
removed options for stopwords which are not used
13 years ago
Michael Peter Christen
added the Google Search Appliance (GSA) api interface to the main menu.
13 years ago
Michael Peter Christen
less latency
13 years ago
Michael Peter Christen
better balancing and duetime-cumputation also for no-delay intranet
13 years ago
Michael Peter Christen
disabled writing new entries to crawl stacks to prevent that a domain
with many documents block refreshing of the crawl queue
13 years ago
Michael Peter Christen
- fix for number of words log message
- adding meta:refresh also to crawler stack
13 years ago
Michael Peter Christen
- added concurrency for robots.txt loading
- changed data model for domain counter
13 years ago
Michael Peter Christen
fixed getSize() which can use the cache size while the crawl is running
13 years ago
Michael Peter Christen
enhancement to solr caching: consider that during a get() the document
is not in solr but the cache points out that a commit is needed to get
the document.
13 years ago
Michael Peter Christen
more auto-commit calls when a search interface is opened, but not when a
search is done there to prevent blocking during search-time.
13 years ago
Michael Peter Christen
if a network configuration is choosed which does not allow DHT and no
P2P communication is in robinson mode) then some menu entries are
disabled which have no use in this mode.
13 years ago
Michael Peter Christen
replaced the custom robots.txt loader by the standard http loader
13 years ago
Michael Peter Christen
enhanced solr caching:
- increased cache size which is needed for longer solr commit time
- speed hacks on cache write code
13 years ago
Michael Peter Christen
- removed unnecessary synchronized and deadlock in crawler
- removed problem with monitoring object on Balancer.wait
- added missing user agent settings
13 years ago
update to Balancer algorithm:
- create a load list from the current list of known hosts
- do not create this list for each Balancer.pop access
- create the list from those hosts which have a zero-waiting time
- select 1/3 from that list which have the most urls waiting
- get hosts from the wainting list in random order
- fixes for some delta-time computations
- always load all urls from hosts which have never been loaded before
13 years ago
moved static method from ClusteredScoreMap to MapDataMining because it
was not used in the ClusteredScoreMap class but only in MapDataMining
13 years ago
- optimize code of augmented parsing to enhence document tags
- commented out augmentedparser.analyse (not function implemented yet)
- adjust init of document title list to always use same list type
13 years ago
Michael Peter Christen
force a commit in advance of a search for the administrator to get most
recent results even if commit time is high and an indexing is ongoing.
13 years ago
Michael Peter Christen
added an option to force a commit to solr.
may be used by a search front-end in case that the commitWithinMs time
is too short to get recently indexed documents.
13 years ago
rise commitWithinMs to default-value from SwitchBoard
(result in lower hd-io)
no dots in memory-graph (there are to much of them)
13 years ago
another performance and memory hack to graphics: this makes it possible
to produce a 100-Megapixel png network graphic image on my 6 year old
laptop in standard configuration in 10 seconds.
13 years ago
Michael Peter Christen
- show more lines in online log
- reverse order is default now
13 years ago
Michael Peter Christen
more image processing hacks
13 years ago
Michael Peter Christen
because the new PngEncoder had a problem with the PixelGrabber which is
caused by a JRE bug, the PixelGrabber had to be circumvented using an
own frame buffer which can be read without a PixelGrabber. This resulted
in ultra-fast and much less memory-consuming transformation. YaCy images
are now generated really fast!
13 years ago
Michael Peter Christen
- added a method for the RasterPlotter to draw arrow endings to lines
- replaced the dot in the NetworkGraph with arrows
- enhanced the image drawing speed using pre-computed color values
- added more attention for OOM cases during very large image painting
13 years ago
Michael Peter Christen
when a new crawl is started, an equal crawl, if still running, is
terminated and the corresponding crawl profile is deleted (this also
clears the crawl queue entries for that crawl profile)
13 years ago
Michael Peter Christen
the web structure image shows the pivot dot in a different color
13 years ago
Michael Peter Christen
- prepared PngEncoder for concurrency: PixelGrabber.grabPixels is the
main time-consuming process. This shall be done in concurrency.
- added concurrent processes to call the PixelGrabber and framework to
do that (queues)
It is now possible to create 4k-Images (3840x2160) i.e. with the Network
Graphics servlet
13 years ago