orbiter
0013d0d0bb
removed superfluous class
12 years ago
orbiter
f90d5296cb
Added new data structure to be used by the balancer (not used yet).
...
These data structures will enable the balancer to store the crawl queue
into individual queues, one each for a single host.
12 years ago
orbiter
0e8d752462
refactoring
12 years ago
orbiter
8ac2e8c8c9
added location navigator which causes that the image to the map search
...
is visible whenever a location is available in the search result.
To activate this, the search.navigation property in yacy.conf must be
modified to the new default values.
12 years ago
orbiter
d86d2be5c3
automatically removed Places autotagging if no location library is
...
wanted
12 years ago
orbiter
214a087cdf
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
12 years ago
Michael Peter Christen
96ed0c980e
- added hosthash to all documents (also fail documents which is needed
...
there for deletion), this fixes a problem for the deletion of old
documents for new crawl starts
- added clickdepth and citation computation for fail documents
12 years ago
Michael Peter Christen
179ad281f9
close include byte buffer after usage
12 years ago
reger
6b9a624808
remove double declaration of TLD_any_zone_filter
12 years ago
orbiter
d2effd21db
fix for npe during location search
12 years ago
orbiter
828603e4f1
fix for 100%CPU problem in error cache cleaning process
12 years ago
orbiter
c64b51134e
hack to add all tokens from the url to text_t. This was working for the
...
RWI index (and still is working) but not for solr-only search indexes.
Maybe we should find a solution using a separate search field instead.
12 years ago
orbiter
6e8377b8ad
do not check all words with synonym library if the library is empty
12 years ago
orbiter
70ba74b23a
disabled ipv4 preference to enable ipv6-only networks like freifunk
12 years ago
orbiter
f3be1930cb
CPU problem when pusing to the error cache; wrong class,
...
ConcurrentHashMap needed for concurrency
12 years ago
Michael Peter Christen
e40671ddb7
better and consistent deletions for error urls
12 years ago
Michael Peter Christen
2602be8d1e
- removed ZURL data structure; removed also the ZURL data file
...
- replaced load failure logging by information which is stored in Solr
- fixed a bug with crawling of feeds: added must-match pattern
application to feed urls to filter out such urls which shall not be in a
wanted domain
- delegatedURLs, which also used ZURLs are now temporary objects in
memory
12 years ago
Michael Peter Christen
31920385f7
set anchor rel attribute of all links to "nofollow" if the html meta
...
contains a robots:nofollow or if the http header contains a
"X-Robots-Tag: nofollow"
12 years ago
Michael Peter Christen
57e00baf26
fix for parsing of image links inside of anchor links (image-links)
12 years ago
Michael Peter Christen
61c5e40687
- replaced the properties object in AnchorURL with distinct variables
...
for anchor attributes.
- this caused that large portions of the parser code had to be adopted
as well
- added a counter target_order_i for anchor links in webgraph
computation
12 years ago
Michael Peter Christen
3ea9bb4427
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
12 years ago
Michael Peter Christen
5e31bad711
- the webgraph shall store all links which appear on a web page and not
...
all unique links! This made it necessary, that a large portion of the
parser and link processing classes must be adopted to carry a different
type of link collection which carry a property attribute which are
attached to web anchors.
- introduction of a new URL class, AnchorURL
- the other url classes, DigestURI and MultiProtocolURI had been renamed
and refactored to fit into a new document package schema, document.id
- cleanup of net.yacy.cora.document package and refactoring
12 years ago
reger
603368fc3e
remove redundant declaration of USER_AGENT
12 years ago
Michael Peter Christen
1a8c64117f
decreased the responseHeaderDB database which is now flushed more
...
frequently. This will preserve more documents in the cache in case of a
crash.
12 years ago
Michael Peter Christen
3e22d05290
added option for daterange properties in GSA interface to use an left-
...
or right-open date range;
i.e. using daterange=..2013-09-09 or daterange=2013-09-02.. additional
to daterange=2013-09-02..2013-09-09
12 years ago
reger
fe87fb638a
adjust test/ParserTest to dc_description data type
12 years ago
Michael Peter Christen
35ab2cef7b
added parsing of 'date', 'dc:date', 'dc.date' and 'last-modified' in
...
html meta fields to get a correct (or: better) date timestamp. The
http:last-modified mostly does not work because it is set to the current
date from most CMS.
12 years ago
Michael Peter Christen
9cc8468b30
added tools to visualize image generation (i.e. during testing)
12 years ago
Michael Peter Christen
dbef8ccfcb
forced deletion of ZURL entries for a specific host for each host that
...
appears in the crawl url list
12 years ago
Michael Peter Christen
e137ff4171
refactoring (im preparation for new removeHost method)
12 years ago
Michael Peter Christen
7a5574cd51
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
12 years ago
Michael Peter Christen
85456f46b2
added two new fields, exact_signature_copycount_i and
...
fuzzy_signature_copycount_i, which count the number of copies of
non-unique documents and assigns this to each document. Thus, each
document there is a number assigned which shows how many copies of this
document exists.
These fields are disabled by default.
12 years ago
orbiter
26366596d9
fix for a problem which ocurres when a site is crawled where the start
...
url is redirected.
12 years ago
Michael Peter Christen
a2511b5600
turned images_alt_txt back to images_alt_sxt because it is not necessary
...
to index the alt text. Indexed image Text is in images_text_t
12 years ago
Michael Peter Christen
85b1922244
activated image type navigation for image search
12 years ago
Michael Peter Christen
9e12fdff23
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
12 years ago
Michael Peter Christen
ab1201fdfd
fixed wrong facet count
12 years ago
Michael Peter Christen
049c3b3f2e
added an option to exclude image search results from text search. This
...
is on by default.
12 years ago
Michael Peter Christen
69f85265e1
added an option to put image links to the crawl queue and handle these
...
like normal documents. Using this option (by default on at this moment;
this might change soon) it is possible to get the exif data into the
search index to be used in image search.
12 years ago
Michael Peter Christen
e8e558a9b7
fix for content domain classification in URIMetadataNode
12 years ago
Michael Peter Christen
a8c5bfcf58
avoid to create unnecessary objects
12 years ago
Michael Peter Christen
5a0de1b77d
moving image description text to image text field
12 years ago
Michael Peter Christen
dc179bd61f
fix for catchall query goal for image search
12 years ago
Michael Peter Christen
5d71a4c8bc
fix for dc:description field
12 years ago
reger
392174de8c
remove all_words, all_strings lists from QueryGoal
...
- only used for text highlighting in parser text (ViewFile.html) which can be done with include_strings only
12 years ago
Michael Peter Christen
169ef8963d
one more fix for image search
12 years ago
Michael Peter Christen
cb85b22725
redesign of the image search process (with much better results,
...
unfortunately the index schema has changed and p2p image search will not
be muchmuch better until many people update)
12 years ago
Michael Peter Christen
6184fd9d9a
fix for solr/gsa result logging
12 years ago
reger
29967102a2
optimized QueryGoal (reducing mem and computation by removing all_hashes)
...
- all_hashes used for text highlighting and word distance computation which can be done with include_hashes only
12 years ago
orbiter
f106345eef
link strings should not be tokenized
12 years ago