Michael Peter Christen
a8253ca49c
added missing unicode transformation in href link contents during
...
parsing
11 years ago
Michael Peter Christen
0cf9e9580b
added clickdepth and CR computation debug code to verify that the
...
process is complete
11 years ago
Michael Peter Christen
234a974955
load image only if their parser flag is activated
11 years ago
Michael Peter Christen
b2c329929f
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
Michael Peter Christen
60187a4ec2
fix in html parser
11 years ago
Michael Peter Christen
e1c1e57877
less overhead calling exist() with only one hash
11 years ago
reger
3d5d366f1c
fix html header in Solr HTMLResponseWriter
...
- move 1st body content after </head> tag
- add closing <span> tag
11 years ago
Michael Peter Christen
5a02d650ee
avoid cloning
11 years ago
Michael Peter Christen
cc39667399
Speed enhancements and less CPU usage during Solr searches when using
...
the embedded Solr (the default). This was obtained by cirumventing solrj
search encapsulation and the implementation of direct index access
methods to Solr.
The effect will not only be seen during search, but this has also a
strong effect on suggestions (much more) and less CPU power usage during
index distribution (which needs many search requests)
11 years ago
Michael Peter Christen
434e13b46d
in host browser also show the properties of failed documents including
...
referrer urls (this is a VERY USEFUL SEO and Web Admin feature!!)
11 years ago
reger
69599566f9
catch one more malformed url in proxy url rewrite
11 years ago
reger
605530fec5
catch proxy url rewrite exception
...
malformed url (" http:\/\/" ) may cause error response
testcase http://localhost:8090/proxy.html?url=http://dictionary.reference.com/browse/test
11 years ago
Michael Peter Christen
9bb7eab389
hacks to prevent storage of data longer than necessary during search and
...
some speed enhancements. This should reduce the memory usage during
heavy-load search a bit.
11 years ago
orbiter
3c3cb78555
- removed a lot of garbage and bloated code from GuiHandler.
...
- transformed log lines to String before they are stored because the
storage space is about 1:250 (45kb for one line before transformation,
180 bytes afterwards)
- this saves up to 10MB RAM so we can increase the number of lines to
1000 again.
11 years ago
Michael Peter Christen
5afa6e3aee
Automatically flush the log cache if a short memory status is reached.
...
For the default of 200 lines this can flush about 10MB.
11 years ago
Michael Peter Christen
030d0776ff
Enhanced crawl start for very, very large crawl lists (i.e. > 5000)
...
which had a problem because of badly used concurrency.
This fix also caused a redesign of the whole host deletion process.
This should fix bug http://bugs.yacy.net/view.php?id=250
11 years ago
Michael Peter Christen
6aabc4e5c8
reduced logging line memory, 10000 lines had filled up 450MB! grrr.
...
(thank you, a bomb from the past)
11 years ago
Michael Peter Christen
1a8783147b
enhanced computation of number of solr documents.
11 years ago
Michael Peter Christen
4948c39e48
added concurrency for mass crawl check
11 years ago
Michael Peter Christen
1b4fa2947d
- fixed a problem which ocurred when a document was not recognized with
...
the right content domain (i.e. identifying that it is an image, text
etc.) because it used the file extension and not an existing mime type
assignment.
- fixed the new setting that images shall be loaded for a better image
search.
- both fixes together makes it now possible to crawl
commons.wikimedia.org which makes use of 'funny' document names (i.e.
ending with .jpg while the document is html)
11 years ago
Michael Peter Christen
82621bead0
When doing bootstraping, always accept one seedlist-File without
...
checking the date of the file. This should help to start the peer in
case that the user has a completely wrong date setting.
11 years ago
Michael Peter Christen
691d7e70fa
added hint to development/commit rss feed
11 years ago
orbiter
20bbde8665
fix for mustmatch regex computation: result had correct semantic, but
...
may have contained multiple same expressions within the disjunction of
domain-restrictions. This fix removes the redundant restrictions and
makes the regex shorter.
11 years ago
Michael Peter Christen
c833d02cf5
fixed webgraph postprocessing (did nothing and repeated to do this...)
11 years ago
Michael Peter Christen
74d0256e93
enhanced postprocessing: fixed bugs, enable proper postprocessing also
...
without the harvestingkey, remove crawl profiles after postprocessing,
speed-up for clickdepth computation.
11 years ago
Michael Peter Christen
7b69c438f7
more methods for the table class
11 years ago
Michael Peter Christen
820b896146
Replaced the inframe loading from yacy.net for donations with the
...
loading of this iframe from the local host. To make this more flexible,
this iframe is loaded once after startup from yacy.net.
11 years ago
reger
0d4efabaa8
fix YaCy version string in proxy headers
...
(config parameter vString not longer used)
11 years ago
sixcooler
d9a02ed277
NPE fix for my last commit
11 years ago
sixcooler
61f627eb85
fix for ssl-connections from proxy-usage staying in close-wait-state
...
+ some extra 'close' in HttpClient
11 years ago
Michael Peter Christen
d328cc4a83
fix for didyoumean, added also more asian alphabets
11 years ago
Michael Peter Christen
90c8577840
enhanced ranking; patches to replace old ranking
11 years ago
Michael Peter Christen
1b61bd40ed
- Added new solr field url_file_name_tokens_t which stores the file name
...
tokens. This can be used to enhance the ranking.
- Added also a rating_i field as basis for later usage.
- enhanced the tokenization process.
11 years ago
orbiter
6efa7532d2
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
11 years ago
orbiter
5f5a97bafc
added the anchor text within web pages to the searcheable entities of a
...
web page. This can be of benefit for the ranking if these fields are
used for boosts.
11 years ago
orbiter
705b3338ee
list more fields available for search and for ranking boosts
11 years ago
sixcooler
d536092fe4
fix false fill NAME_CACHE_MISS-DNS-Cache in case of a timeout
...
for eg. caused by massive requests when crawl from file
11 years ago
Michael Peter Christen
78e7aadb26
removed unused initialization method
11 years ago
Michael Peter Christen
4fbc4740df
removed warnings
11 years ago
Michael Peter Christen
21aa6a0321
migration to Solr 4.5.0
11 years ago
Michael Peter Christen
ef31d0f279
fix for rss reader, see http://bugs.yacy.net/view.php?id=294
11 years ago
Michael Peter Christen
101a6e6e14
Patch the citation index for links with canonical tags.
...
This shall fulfill the following requirement:
If a document A links to B and B contains a 'canonical C', then the
citation rank computation shall consider that A links to C and B does
not link to C.
To do so, we first must collect all canonical links, find all references
to them, get the anchor list of the documents and patch the citation
reference of these links.
11 years ago
reger
fd119deb00
fix NPE on modified since check ( Response.requestHeader allowed to be null)
12 years ago
Michael Peter Christen
b28d43decc
added two more fields source_cr_host_norm_i,target_cr_host_norm_i in
...
webgraph and an addition to postprocessing to copy all cr ranking
attributes to the link edges associated to the postprocessing documents
12 years ago
Michael Peter Christen
a52f3a597e
fix for canonical-from-http-header feature
12 years ago
Michael Peter Christen
2dd7c5be44
added parsing of http-canonical tags (untested, could not find an
...
example page)
12 years ago
Michael Peter Christen
4476dea5ba
do not fail if a wrong boost key is used; instead, print only a warning
...
See also: http://bugs.yacy.net/view.php?id=293
12 years ago
Michael Peter Christen
3bf0104199
fix for crawl domain counter limitation (limit was reached too early)
12 years ago
Michael Peter Christen
82bfd9e00a
- crawl profiles shall be deleted from active and passive stacks if they
...
are deleted to terminate the crawl because otherwise the crawl will go
on after the load-from-passive stack policy.
- better check if a crawl is terminated using the loader queue.
12 years ago
Michael Peter Christen
1b3d26dd23
hack to remove most of the warning: deprecated messages (but not all,
...
one is left)
12 years ago