Michael Peter Christen
79464189a4
The 'Locale' vocabulary, which is generated by geo data, has now the
...
objectspace "http://dbpedia.org/resource/ "
13 years ago
Michael Peter Christen
61bb52d55c
- using http://purl.org/dc/terms/references to refer from an
...
auto-annotated document to a 'pseudo-linked' document which has an url
created with an object-prefix as defined in the vocabulary file
13 years ago
Michael Peter Christen
50c576599b
allow multiple parser options instead of printing an error
13 years ago
Michael Peter Christen
8b53771db2
changed behavior of navigation processing:
...
- vocabulary annotation is not done any more into the metadata of urldb
- vocabularies are written into the jena triplestore using a rdf
vocabulary
- vocabularies for rdf tripel must be updated; refactoring done
- with the new navigation tags in the triplestore a faster
pre-urldb-lookup is possible: navigation is processed now within the RWI
during pre-ranking retrieval
- added also a Owl vocabulary stub to add the plain-text url to the
triplestore using the owl:sameas predicate
13 years ago
Michael Peter Christen
5fc6524ca8
- moved triple store to net.yacy.cora.lod (should be generalized there
...
later
- added abstract add, delete, get methods in the triplestore
- added generation of triples after auto-annotation
- migrated all MultiProtocolURI objects to DigestURI in the parser since
the url hash is needed as subject value in the triples in the triple
store
13 years ago
cominch
bbfc53b663
bugfix
13 years ago
cominch
65c5826d93
bugfix
...
Conflicts:
source/net/yacy/document/parser/augment/AugmentParser.java
13 years ago
cominch
5f8ba7f4f2
small changes
...
Conflicts:
source/net/yacy/document/parser/augment/AugmentParser.java
source/net/yacy/interaction/Interaction.java
13 years ago
cominch
90512640bf
Added config switches for custom parser
...
Conflicts:
source/net/yacy/document/TextParser.java
13 years ago
cominch
bcbd8eee33
Add several parsers, for RDFa and rdf files.
...
Conflicts:
source/net/yacy/document/TextParser.java
13 years ago
cominch
9cbfc1a1c0
augmentedProxy, which forwards every proxy request to a
...
rewrite engine to customize existing webpages. originally implemented by
Florian Richter.
Conflicts:
source/de/anomic/http/server/HTTPDProxyHandler.java
13 years ago
Michael Peter Christen
cde20911bb
saved a bit more ram using UTF8 String compression for OpenGeoDB and
...
Geonames data files.
13 years ago
Michael Peter Christen
225ee42879
made the GeoLocation into an interface with the current
...
integer implementation as accuracy implementation of 1.863cm
13 years ago
Michael Peter Christen
96e9d77270
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
...
Conflicts:
source/net/yacy/cora/sorting/WeakPriorityBlockingQueue.java
13 years ago
Michael Peter Christen
96c8119b50
added GeoLocation / GeoPoint classes which uses less memory than
...
Location/Coordinates and has initializers with correct order of lat,lon
coordinates
13 years ago
Michael Peter Christen
461a0ce052
removed warnings
13 years ago
Michael Peter Christen
2fe207f813
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
13 years ago
Michael Peter Christen
514700291a
moved Vocabulary to cora package (added in git
...
964406ad17
)
13 years ago
Michael Peter Christen
0284a4d88f
more fixes for double precision of coordinates
13 years ago
Michael Peter Christen
964406ad17
added concurrency enhancement to xml parser
13 years ago
Michael Peter Christen
e0d8643226
- performance hacks
...
- added log warnings in case that search processes run into time-out
situations
- better concurrency for Integer formatter (used a non-synchronized
formatter before)
- bugfix for search termination (a poison pill was missing)
- added timeout parameters for search (again) -> target is, that they
are never reached.
13 years ago
Michael Peter Christen
6e83b02b83
- bugfix for surrogate file reader
...
- bugfix for location search: suppress empty search
13 years ago
Michael Peter Christen
9b4c699526
ehanced location search:
...
- search request are now made using a map boundary
- search results are only computed for the map boundary
- the number of results is adopted to the results in the visible range
- added a double-buffering for the search result markers
- added a search query option for the search results:
/radius/<lat>/<lon>/<radius>
13 years ago
Michael Peter Christen
4d3cc02168
replaced old bzip2 library against better documented commons-compress
...
package from http://commons.apache.org/compress/
13 years ago
Michael Peter Christen
c15fcde1c8
add-on to latest commit
13 years ago
Michael Peter Christen
81737dcb18
removed stack trace from swf parser since we cant do anything there
13 years ago
Michael Peter Christen
acf8d521a2
fix for http://bugs.yacy.net/view.php?id=126
13 years ago
Michael Peter Christen
89142d1e8d
removed (not all) warnings
13 years ago
Roland 'Quix0r' Haeder
a093ccf5eb
Now used synchronization in all close() methods to make sure all objects
...
are 'closed' in an ordered way
Conflicts:
source/de/anomic/http/server/ChunkedInputStream.java
source/de/anomic/http/server/ChunkedOutputStream.java
source/de/anomic/http/server/ContentLengthInputStream.java
source/net/yacy/cora/protocol/Domains.java
source/net/yacy/cora/services/federated/solr/SolrShardingConnector.java
source/net/yacy/cora/services/federated/solr/SolrSingleConnector.java
source/net/yacy/document/content/dao/PhpBB3Dao.java
source/net/yacy/document/parser/html/AbstractTransformer.java
source/net/yacy/kelondro/blob/BEncodedHeap.java
source/net/yacy/kelondro/blob/HeapReader.java
source/net/yacy/kelondro/index/RAMIndexCluster.java
source/net/yacy/kelondro/io/ByteCountInputStream.java
source/net/yacy/kelondro/logging/ConsoleOutErrHandler.java
source/net/yacy/kelondro/table/SQLTable.java
13 years ago
Michael Peter Christen
ba6aaabc51
refactoring + parser bugfixes
13 years ago
Michael Peter Christen
09484955dc
added new entry class for embed tags
13 years ago
Michael Peter Christen
453010bd68
- solved problems with backpath normalization
...
- redesigned in/outbound link handover
- removed iframe links from inbound/outbound in solr scheme
13 years ago
Michael Peter Christen
659178942f
- Redesigned crawler and parser to accept embedded links from the NOLOAD
...
queue and not from virtual documents generated by the parser.
- The parser now generates nice description texts for NOLOAD entries
which shall make it possible to find media content using the search
index and not using the media prefetch algorithm during search (which
was costly)
- Removed the media-search prefetch process from image search
13 years ago
Michael Peter Christen
f8cd57c92f
new indexing strategy: ALL links that appear anywhere are indexed, not
...
only links where the content can be parsed. All non-parseable links are
placed into the noload queue. The search process must therefore be able
to filter out non-text search results.
- This fixes the problem that image search results appeared in the text
search.
- The interactive search can retrieve now ALL types of links
- The p2p interface is now extended to retrieve only certain types of
links (text, image, video, apps)
- The search process has an extension to filter the right document type
according to the search query
13 years ago
Michael Peter Christen
a1a5b015d8
refactoring: moved document Classification to cora package
13 years ago
Michael Peter Christen
4d5da75814
fix for parser problem if a <a>-tag is 'within' html tags with unclosed
...
tags. That prevented the <a> tags from beeing recognized. This is a fix
for http://forum.yacy-websuche.de/viewtopic.php?p=25516#p25516
13 years ago
Michael Peter Christen
046f3a7e8d
check if httpc has decompressed the release file and rename the file
...
from .tar.gz to .tar if that happened
13 years ago
Michael Peter Christen
e101c2e0e2
added changes from copperdust (submitted by email):
...
1. Improved and fixed language detection:
1.1 Identificator.java - recognition fix (improved)
1.2 DCEntry.java - fix (changed detection order due to detection from
tld in many cases is incorrect)
1.3 MultiProtocolURI.java - fixed and enhanced language from tld
detection (all currently used top-level domains; ccTLD added but not
tested).
2. Ukrainian language update.
3. Main Slavic languages langstats (tested and works fine).
13 years ago
Michael Peter Christen
8d63a5887c
bugfixes
13 years ago
Michael Peter Christen
9ad1d8dde2
complete redesign of crawl queue monitoring: do not look at a
...
ready-prepared crawl list but at the stacks of the domains that are
stored for balanced crawling. This affects also the balancer since that
does not need to prepare the pre-selected crawl list for monitoring. As
a effect:
- it is no more possible to see the correct order of next to-be-crawled
links, since that depends on the actual state of the balancer stack the
next time another url is requested for loading
- the balancer works better since the next url can be selected according
to the current situation and not according to a pre-selected order.
13 years ago
Michael Peter Christen
7e4e3fe5b6
free some memory after parsing html
13 years ago
Michael Peter Christen
4540174fe0
memory hacks
13 years ago
Michael Peter Christen
2e5cd6a1b2
fixed parser extension deny list generation and usage
13 years ago
Michael Peter Christen
8bee1472c9
there is no noindex, only nofollow in links
13 years ago
Michael Peter Christen
c560a582ac
fix for single-word vocabulary lines
13 years ago
Michael Peter Christen
ef78f22ee1
performance hack
13 years ago
Michael Peter Christen
1f4f60654a
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
...
Conflicts:
source/net/yacy/document/parser/pdfParser.java
13 years ago
reger
32104360ce
PDFParser - return at least first 3 pages of PDF
...
fix for pdf parsing without returning parsed text due to interruption by
time out.
13 years ago
Michael Peter Christen
eadb58dd87
small enhancements in pdf parser
13 years ago
reger
b616de5973
PDFParser - return at least first 3 pages of PDF
...
fix for pdf parsing without returning parsed text due to interruption by time out.
13 years ago
Michael Peter Christen
7f9b6b7a0c
added switches to ConfigParser to accept/deny documents by their
...
extension
13 years ago
Michael Peter Christen
4901cee3cc
suppress auto-tagged subject entries when sending out or receiving
...
metadata from other peers
13 years ago
Michael Peter Christen
83009d86f7
added the vocabulary navigator. It can be very simply tested by
...
switching on the locale dictionaries.
13 years ago
Michael Peter Christen
a58dc4a91f
added autotagging to document condenser:
...
- tags that are automatically generated now enrich the dc:subject
- auto-generated tags have a '$' at the beginning of the tag
- auto-generated tags lead the tag name with a vocabulary name
each tag has the form
$<vocabulary-name>:<tag-printname-space-replaced-by-'_'>
13 years ago
Michael Peter Christen
254adea51c
small fixes
13 years ago
Michael Peter Christen
b7bb84c0bb
set a limit to CharBuffer object size to fight against bad/too large
...
content
13 years ago
Michael Christen
e6d51363ee
Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
13 years ago
Marek Otahal
72adbeae90
!Important: move from Hashtable to HashMap
...
Hashtable is an obsolete collection v1, now since v2 offers HashMap with same or better
functionality. Please review, almost all code was already moved, so only a few changes. That is not the issue,
but I found notices that some (ugly big) helper classes had to be created in past
to compensate missing Hashtable's functionality. I'd like input if we can remove some of them.
look for //FIX: if these commits
Signed-off-by: Marek Otahal <markotahal@gmail.com>
13 years ago
Michael Christen
fa8da7f89d
vocabularies are now also used as source for a did-you-mean computation
13 years ago
Michael Christen
eaec14ecc4
Dictionaries from words caches can now be used as autotagging vocabulary
13 years ago
Michael Peter Christen
91940fdf56
redesign of WordCache to be prepared to hold multiple
...
independent dictionaries. Such dictionaries can then be also used as
simplified vocabularies.
13 years ago
Michael Christen
bd40a10230
added autotaggig stub .. only reading and parsing of vocabularies at
...
this time
13 years ago
Michael Christen
c04bfaa51b
refactoring
13 years ago
Michael Christen
1f4afb4dc0
performance hacks
13 years ago
Michael Christen
762e0ecfb6
fixed localization dictionaries, see
...
http://forum.yacy-websuche.de/viewtopic.php?f=5&t=3418&view=next
13 years ago
Michael Christen
9cd469e6d6
added pull request from als plus an NPE fix
13 years ago
Al Sutton
39898cb94a
Added try/finally protection to ensure streams are closed. Added initial size guess for the CharBuffer
13 years ago
Al Sutton
4c67a964a1
Added try/finally protection to ensure streams are closed. Added initial size guess for the CharBuffer
13 years ago
Al Sutton
3f9b9f953f
Added close() to ensure buffer close actions are invoked
13 years ago
Al Sutton
d73c84f9a0
Allow initial buffer size definition in TransformWriter, and use available() method to set it in htmlParser. In this situation a ByteArrayInputStream is used so the available() method gives a good size estimation and avoid the buffer needing to be continually grown
13 years ago
Al Sutton
f02ea27b31
Added missing closure of ByteArrayInputSteam
13 years ago
Al Sutton
8993cac4d8
Initial performance improvements
13 years ago
orbiter
ebd840ebf6
- enhanced description on search front page
...
- fixed language and heuristic modifier
- added hint to crawl start that we can do also ftp and smb crawls
- added a protocol extension to remote crawls to transport all search modifiers to remote peers
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8108 6c8d7289-2bf4-0310-a012-ef5d649a1542
13 years ago
orbiter
e22f8497c9
- tested the ARC methods
...
- removed strict authentication (if password is empty; this was buggy and not useful; can be switched on if necessary globally and not for each interface method)
- increased speed of CrawlResults page (no dns lookup any more)
- increased speed of favicon display (removed dns lookup)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8104 6c8d7289-2bf4-0310-a012-ef5d649a1542
13 years ago
orbiter
5a55397f99
some last-minute performance hacks
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8101 6c8d7289-2bf4-0310-a012-ef5d649a1542
13 years ago
apfelmaennchen
564374d1fe
- included YMarks in addition to old bookmarks in yacysearchitem.html; don't get confused by the old bookmark dialog, the ymark is automatically added silently beforehand.
...
- reworked bookmark creation on crawlstart
- many smaller adjustments to ymarks
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8072 6c8d7289-2bf4-0310-a012-ef5d649a1542
13 years ago
orbiter
804e48888b
smaller bug fixes for search behavior; should produce less unnecessary removals and an exact number of results as shown in counter
...
should also be a little bit faster
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8057 6c8d7289-2bf4-0310-a012-ef5d649a1542
13 years ago
orbiter
85d6bf4ac4
fixed urls to media content during indexing
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8021 6c8d7289-2bf4-0310-a012-ef5d649a1542
13 years ago
orbiter
0d858d48ec
replaced String with StringBuilder in suggestion process
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8020 6c8d7289-2bf4-0310-a012-ef5d649a1542
13 years ago
orbiter
d2ea250d99
refactoring:
...
- moved many classes from de.anomic to net.yacy
- made more sub-packages for search classes
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7973 6c8d7289-2bf4-0310-a012-ef5d649a1542
14 years ago
low012
277b454a62
*) added comments
...
*) minor refactoring
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7971 6c8d7289-2bf4-0310-a012-ef5d649a1542
14 years ago
orbiter
6b22865dbc
- removed some warinings
...
- removed a dead update location
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7970 6c8d7289-2bf4-0310-a012-ef5d649a1542
14 years ago
orbiter
8a428d3e77
ensure termination of pdf parser to avoid deadlocking of other processes during search result preparation
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7958 6c8d7289-2bf4-0310-a012-ef5d649a1542
14 years ago
orbiter
85a5487d6d
YaCy can now use the solr index to compute text snippets. This makes search result preparation MUCH faster because no document fetching and parsing is necessary any more.
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7943 6c8d7289-2bf4-0310-a012-ef5d649a1542
14 years ago
orbiter
0819e1d397
protection against OOM cases in image parser. See also bugs.yacy.net/view.php?id=54
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7942 6c8d7289-2bf4-0310-a012-ef5d649a1542
14 years ago
orbiter
49e5ca579f
added new configuration property "crawler.embedLinksAsDocuments". If this is switched on (this is default now), the all embedded image, audio and video links from all parsed documents are added to the search index as individual document. This will increase the search index size dramatically but will also enable us to create a much faster image, audio and video search. If the flag is switched on, the index entries are also stored to a solr index, if this is also enabled.
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7931 6c8d7289-2bf4-0310-a012-ef5d649a1542
14 years ago
orbiter
610b01e1c3
- added a 'add every media object linked in a html document as a new document' to the html parser. This causes that all image, app, video or audio file that is linked in a html file is added as document. In fact that means that parsing a single html document may cause that a number of documents is inserted into the search index.
...
- some refactoring for mime type discovery
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7919 6c8d7289-2bf4-0310-a012-ef5d649a1542
14 years ago
orbiter
b5252ef91f
added new word recommendation library in DictionaryLoader_p.html
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7913 6c8d7289-2bf4-0310-a012-ef5d649a1542
14 years ago
orbiter
1c007188ad
bugfixes in html parser
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7912 6c8d7289-2bf4-0310-a012-ef5d649a1542
14 years ago
orbiter
231074bf0a
fixed a parsing bug by reverting SVN 7766
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7910 6c8d7289-2bf4-0310-a012-ef5d649a1542
14 years ago
low012
24e76a7b69
*) Replaced occurrences of "Wikimedia" with "MediaWiki" where applicable. (Thanks to the folks of 0x20.be for pointing this out.)
...
*) Added description of where to place MediaWiki dump for import.
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7905 6c8d7289-2bf4-0310-a012-ef5d649a1542
14 years ago
orbiter
5dd2efc9a2
- bugfixes in html parser
...
- new fields in solr
- extended file viewer to debug parser
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7897 6c8d7289-2bf4-0310-a012-ef5d649a1542
14 years ago
orbiter
51cf697acd
refactoring: moved all score-related classes to new ranking package
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7889 6c8d7289-2bf4-0310-a012-ef5d649a1542
14 years ago
sixcooler
eb14111200
encapsulate potential expensive objects in TextSnippet to allow GC them asap
...
this reduces chance of OOMs at massive search & snippet-fetching
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7865 6c8d7289-2bf4-0310-a012-ef5d649a1542
14 years ago
sixcooler
a311596881
finishing up my commits (7855-7858) which could be helpful for
...
not declaring inside loops (helps GC of some VMs)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7859 6c8d7289-2bf4-0310-a012-ef5d649a1542
14 years ago
sixcooler
9170a434ed
throwing an exception again in FileUtils.copy(reader, writer)
...
OOMs could occour here and should not be ignored
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7858 6c8d7289-2bf4-0310-a012-ef5d649a1542
14 years ago
sixcooler
ce248cc8dd
less byte-arrays of response-content, less byte-array <-> stream conversation
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7856 6c8d7289-2bf4-0310-a012-ef5d649a1542
14 years ago
sixcooler
59b767eebd
stop loading via http at defined maximum of bytes - even size is unknown before loading
...
using max-file-size of type int for parsing documents
(since content is used as byte-arrays, 'integer' should be maximum)
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7855 6c8d7289-2bf4-0310-a012-ef5d649a1542
14 years ago
orbiter
299af4943c
added another memory protection hack
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7849 6c8d7289-2bf4-0310-a012-ef5d649a1542
14 years ago
orbiter
b06faab9d3
do not allocate a StringBuilder object in case that there is not enough memory for that
...
git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7846 6c8d7289-2bf4-0310-a012-ef5d649a1542
14 years ago