yacy_search_server

Commit Graph

Author	SHA1	Message	Date
Michael Peter Christen	659178942f	- Redesigned crawler and parser to accept embedded links from the NOLOAD queue and not from virtual documents generated by the parser. - The parser now generates nice description texts for NOLOAD entries which shall make it possible to find media content using the search index and not using the media prefetch algorithm during search (which was costly) - Removed the media-search prefetch process from image search	13 years ago
Michael Peter Christen	f8cd57c92f	new indexing strategy: ALL links that appear anywhere are indexed, not only links where the content can be parsed. All non-parseable links are placed into the noload queue. The search process must therefore be able to filter out non-text search results. - This fixes the problem that image search results appeared in the text search. - The interactive search can retrieve now ALL types of links - The p2p interface is now extended to retrieve only certain types of links (text, image, video, apps) - The search process has an extension to filter the right document type according to the search query	13 years ago
Michael Peter Christen	a1a5b015d8	refactoring: moved document Classification to cora package	13 years ago
Michael Peter Christen	4d5da75814	fix for parser problem if a <a>-tag is 'within' html tags with unclosed tags. That prevented the <a> tags from beeing recognized. This is a fix for http://forum.yacy-websuche.de/viewtopic.php?p=25516#p25516	13 years ago
Michael Peter Christen	046f3a7e8d	check if httpc has decompressed the release file and rename the file from .tar.gz to .tar if that happened	13 years ago
Michael Peter Christen	e101c2e0e2	added changes from copperdust (submitted by email): 1. Improved and fixed language detection: 1.1 Identificator.java - recognition fix (improved) 1.2 DCEntry.java - fix (changed detection order due to detection from tld in many cases is incorrect) 1.3 MultiProtocolURI.java - fixed and enhanced language from tld detection (all currently used top-level domains; ccTLD added but not tested). 2. Ukrainian language update. 3. Main Slavic languages langstats (tested and works fine).	13 years ago
Michael Peter Christen	8d63a5887c	bugfixes	13 years ago
Michael Peter Christen	9ad1d8dde2	complete redesign of crawl queue monitoring: do not look at a ready-prepared crawl list but at the stacks of the domains that are stored for balanced crawling. This affects also the balancer since that does not need to prepare the pre-selected crawl list for monitoring. As a effect: - it is no more possible to see the correct order of next to-be-crawled links, since that depends on the actual state of the balancer stack the next time another url is requested for loading - the balancer works better since the next url can be selected according to the current situation and not according to a pre-selected order.	13 years ago
Michael Peter Christen	7e4e3fe5b6	free some memory after parsing html	13 years ago
Michael Peter Christen	4540174fe0	memory hacks	13 years ago
Michael Peter Christen	2e5cd6a1b2	fixed parser extension deny list generation and usage	13 years ago
Michael Peter Christen	8bee1472c9	there is no noindex, only nofollow in links	13 years ago
Michael Peter Christen	c560a582ac	fix for single-word vocabulary lines	13 years ago
Michael Peter Christen	ef78f22ee1	performance hack	13 years ago
Michael Peter Christen	1f4f60654a	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Conflicts: source/net/yacy/document/parser/pdfParser.java	13 years ago
reger	32104360ce	PDFParser - return at least first 3 pages of PDF fix for pdf parsing without returning parsed text due to interruption by time out.	13 years ago
Michael Peter Christen	eadb58dd87	small enhancements in pdf parser	13 years ago
reger	b616de5973	PDFParser - return at least first 3 pages of PDF fix for pdf parsing without returning parsed text due to interruption by time out.	13 years ago
Michael Peter Christen	7f9b6b7a0c	added switches to ConfigParser to accept/deny documents by their extension	13 years ago
Michael Peter Christen	4901cee3cc	suppress auto-tagged subject entries when sending out or receiving metadata from other peers	13 years ago
Michael Peter Christen	83009d86f7	added the vocabulary navigator. It can be very simply tested by switching on the locale dictionaries.	13 years ago
Michael Peter Christen	a58dc4a91f	added autotagging to document condenser: - tags that are automatically generated now enrich the dc:subject - auto-generated tags have a '$' at the beginning of the tag - auto-generated tags lead the tag name with a vocabulary name each tag has the form $<vocabulary-name>:<tag-printname-space-replaced-by-'_'>	13 years ago
Michael Peter Christen	254adea51c	small fixes	13 years ago
Michael Peter Christen	b7bb84c0bb	set a limit to CharBuffer object size to fight against bad/too large content	13 years ago
Michael Christen	e6d51363ee	Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git	13 years ago
Marek Otahal	72adbeae90	!Important: move from Hashtable to HashMap Hashtable is an obsolete collection v1, now since v2 offers HashMap with same or better functionality. Please review, almost all code was already moved, so only a few changes. That is not the issue, but I found notices that some (ugly big) helper classes had to be created in past to compensate missing Hashtable's functionality. I'd like input if we can remove some of them. look for //FIX: if these commits Signed-off-by: Marek Otahal <markotahal@gmail.com>	13 years ago
Michael Christen	fa8da7f89d	vocabularies are now also used as source for a did-you-mean computation	13 years ago
Michael Christen	eaec14ecc4	Dictionaries from words caches can now be used as autotagging vocabulary	13 years ago
Michael Peter Christen	91940fdf56	redesign of WordCache to be prepared to hold multiple independent dictionaries. Such dictionaries can then be also used as simplified vocabularies.	13 years ago
Michael Christen	bd40a10230	added autotaggig stub .. only reading and parsing of vocabularies at this time	13 years ago
Michael Christen	c04bfaa51b	refactoring	13 years ago
Michael Christen	1f4afb4dc0	performance hacks	13 years ago
Michael Christen	762e0ecfb6	fixed localization dictionaries, see http://forum.yacy-websuche.de/viewtopic.php?f=5&t=3418&view=next	13 years ago
Michael Christen	9cd469e6d6	added pull request from als plus an NPE fix	13 years ago
Al Sutton	39898cb94a	Added try/finally protection to ensure streams are closed. Added initial size guess for the CharBuffer	13 years ago
Al Sutton	4c67a964a1	Added try/finally protection to ensure streams are closed. Added initial size guess for the CharBuffer	13 years ago
Al Sutton	3f9b9f953f	Added close() to ensure buffer close actions are invoked	13 years ago
Al Sutton	d73c84f9a0	Allow initial buffer size definition in TransformWriter, and use available() method to set it in htmlParser. In this situation a ByteArrayInputStream is used so the available() method gives a good size estimation and avoid the buffer needing to be continually grown	13 years ago
Al Sutton	f02ea27b31	Added missing closure of ByteArrayInputSteam	13 years ago
Al Sutton	8993cac4d8	Initial performance improvements	13 years ago
orbiter	ebd840ebf6	- enhanced description on search front page - fixed language and heuristic modifier - added hint to crawl start that we can do also ftp and smb crawls - added a protocol extension to remote crawls to transport all search modifiers to remote peers git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8108 6c8d7289-2bf4-0310-a012-ef5d649a1542	13 years ago
orbiter	e22f8497c9	- tested the ARC methods - removed strict authentication (if password is empty; this was buggy and not useful; can be switched on if necessary globally and not for each interface method) - increased speed of CrawlResults page (no dns lookup any more) - increased speed of favicon display (removed dns lookup) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8104 6c8d7289-2bf4-0310-a012-ef5d649a1542	13 years ago
orbiter	5a55397f99	some last-minute performance hacks git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8101 6c8d7289-2bf4-0310-a012-ef5d649a1542	13 years ago
apfelmaennchen	564374d1fe	- included YMarks in addition to old bookmarks in yacysearchitem.html; don't get confused by the old bookmark dialog, the ymark is automatically added silently beforehand. - reworked bookmark creation on crawlstart - many smaller adjustments to ymarks git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8072 6c8d7289-2bf4-0310-a012-ef5d649a1542	13 years ago
orbiter	804e48888b	smaller bug fixes for search behavior; should produce less unnecessary removals and an exact number of results as shown in counter should also be a little bit faster git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8057 6c8d7289-2bf4-0310-a012-ef5d649a1542	13 years ago
orbiter	85d6bf4ac4	fixed urls to media content during indexing git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8021 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	0d858d48ec	replaced String with StringBuilder in suggestion process git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@8020 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	d2ea250d99	refactoring: - moved many classes from de.anomic to net.yacy - made more sub-packages for search classes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7973 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
low012	277b454a62	) added comments ) minor refactoring git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7971 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	6b22865dbc	- removed some warinings - removed a dead update location git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7970 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	8a428d3e77	ensure termination of pdf parser to avoid deadlocking of other processes during search result preparation git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7958 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	85a5487d6d	YaCy can now use the solr index to compute text snippets. This makes search result preparation MUCH faster because no document fetching and parsing is necessary any more. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7943 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	0819e1d397	protection against OOM cases in image parser. See also bugs.yacy.net/view.php?id=54 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7942 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	49e5ca579f	added new configuration property "crawler.embedLinksAsDocuments". If this is switched on (this is default now), the all embedded image, audio and video links from all parsed documents are added to the search index as individual document. This will increase the search index size dramatically but will also enable us to create a much faster image, audio and video search. If the flag is switched on, the index entries are also stored to a solr index, if this is also enabled. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7931 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	610b01e1c3	- added a 'add every media object linked in a html document as a new document' to the html parser. This causes that all image, app, video or audio file that is linked in a html file is added as document. In fact that means that parsing a single html document may cause that a number of documents is inserted into the search index. - some refactoring for mime type discovery git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7919 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	b5252ef91f	added new word recommendation library in DictionaryLoader_p.html git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7913 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	1c007188ad	bugfixes in html parser git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7912 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	231074bf0a	fixed a parsing bug by reverting SVN 7766 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7910 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
low012	24e76a7b69	) Replaced occurrences of "Wikimedia" with "MediaWiki" where applicable. (Thanks to the folks of 0x20.be for pointing this out.) ) Added description of where to place MediaWiki dump for import. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7905 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	5dd2efc9a2	- bugfixes in html parser - new fields in solr - extended file viewer to debug parser git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7897 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	51cf697acd	refactoring: moved all score-related classes to new ranking package git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7889 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
sixcooler	eb14111200	encapsulate potential expensive objects in TextSnippet to allow GC them asap this reduces chance of OOMs at massive search & snippet-fetching git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7865 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
sixcooler	a311596881	finishing up my commits (7855-7858) which could be helpful for not declaring inside loops (helps GC of some VMs) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7859 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
sixcooler	9170a434ed	throwing an exception again in FileUtils.copy(reader, writer) OOMs could occour here and should not be ignored git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7858 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
sixcooler	ce248cc8dd	less byte-arrays of response-content, less byte-array <-> stream conversation git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7856 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
sixcooler	59b767eebd	stop loading via http at defined maximum of bytes - even size is unknown before loading using max-file-size of type int for parsing documents (since content is used as byte-arrays, 'integer' should be maximum) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7855 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	299af4943c	added another memory protection hack git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7849 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	b06faab9d3	do not allocate a StringBuilder object in case that there is not enough memory for that git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7846 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	2d4bb139d3	- added counting of links with noindex tag for solr index - bugfixes for solr index git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7820 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	bda3eec0ff	added parsing of canonical link element to html parser git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7812 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	9706fc55aa	enhanced content scraper (should discover urls much faster in case of very large plain texts) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7787 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	f667b9c289	enhanced identificator: using AtomicInteger for counter git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7785 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	115abc8917	- more attributes for search progress bar - moved cache strategy to cora package git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7778 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	77fe69395d	added jempbox-1.5.0.jar which is required by pdfbox-1.5 as stated in http://pdfbox.apache.org/dependencies.html git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7774 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	0c1b29f3c9	- applied many small performance hacks - added a memory limitation in the zip parser and the pdf parser - added a search throttling: if there are too many search queries are still to be computed, then new requests are not accepted for some time. if after a one second still no space is there to perform another search, the search terminates with no results. this case should only happen in case of DoS-like situations and in case of strong load on a peer like if it is integrated in metager. - added a search cache deletion process that removes search requests in case that throttling happens git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7766 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	4bea3f9714	hack to reduce resource contention caused by massive UTF8 decodings which use java.nio resources: used a ASCII String <-> byte[] conversion wherever possible. Many Strings in YaCy are hashes which are pure ASCII (base64 hashes). The new ASCII String <-> byte[] conversion method have less computation overhead than the UTF8 conversion. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7746 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	e28bd0d038	fix for some possible causes of memory leaks git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7741 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	10e2f588f8	- enhanced ybr ranking computation - many speed/performance hacks - added solr charding and new charding web interface - added option to switch off the yacy index when using solr - added new fail-url categories which are used to make a distinction which fail-urls to be sent to solr - refactoring/renaming of some method names to distinguish host/url hashes better - a large number of bug/npe fixes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7738 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	3ed4a09368	small features, some bug fixes and performance hacks git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7733 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	205cc75157	abstraction of surrogate main element (xmlns:geo was missing for wiki extracts) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7727 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	021840e5ba	removed (almost) deadlocks and unnecessary CPU load git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7726 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	9248a4eef4	reduce teh effect of 'Bildersuche findet generierte HTML-Seiten als Bilder' see http://bugs.yacy.net/view.php?id=9 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7705 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	76f2817e00	a fix for the snippet computation and hopefully better snippets git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7701 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	deda54d684	- relaxed matching of string-search (this is now case-insensitive) - added transport of string-search pattern to remote search protocol - fixed a problem parsing snippets with a '-' inside git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7700 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	15e3a57b4e	removed unused functions in condenser git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7698 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	e3d19d0a90	fix in Document inboundlinks/outboundlinks sorting git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7690 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	4e8fa03514	added more attributes to html evaluation git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7688 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	528da7c9ea	removed unused class and added license header for new class git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7680 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	f6077b3cc0	added more attributes for html parser and enhanced data structures git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7679 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	d8e934c085	better abstraction of http client identification git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7675 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	b77b8cac0c	- enhanced html parser: recognized much more details in the content - added more properties to solr index - refactoring - more constants in switchboard - fix for some NPEs - recognition of more images - removed synchronization in HandleMap (obviously not necessary?) - added a nolocal configuration to remove excessive dns lookup (works only on allip - default off). Indexes produced with this setting are all flagged with 'local' and are (on purpose) not usable for freeworld because they will be rejected as beeing local. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7672 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	3d5104d357	- fixed a bug in crawl start with file name (npe in new url) - added deletion of solr index in IndexControlRWIs - added asynchronous adding of large url lists (happens when crawls are startet with file) - fixed npe in Image display - replaced language warning with fine logging - added a domain name cache in Domains that helps to speed up the isLocal property (less DNS lookups) - added a new storage class for this new cache: KeyList. The domain key list is stored in DATA/WORK/globalhosts.list - added concurrent solr updates and chunked transfers (50 documents until a commit is done) for high-speed feeding (> 40000 ppm) - fixed a bug in content scraper that chopped off large parts of crawl lists (using crawl start from file) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7666 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	958ff4778e	enhanced location search: search is now done using verify=false (instead of verify=cacheonly) which will cause that much more targets can be found. This showed a bug where no location information was used from the metadata (and other metadata information) if cache=false is requested. The bug was fixed. Added also location parsing from wikimedia dumps. A wikipedia dump can now also be a source for a location search. Fixed many smaller bugs in connection with location search. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7657 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	c17d102bd8	enhanced speed for OrderedScoreMap inc method and size comparisment in concurrent environments git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7653 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	b788182954	some enhancements to scoring speed git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7652 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	01690eab86	fix for mediawiki importer and wikicode parser git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7651 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	4c013d9088	more UTF8 getBytes() performance hacks git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7649 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	564184909a	enhanced the surrogate parser: better reading of UTF-8 characters git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7634 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	156cf02703	- added an index constraint 'has location' to the condenser - added evaluation of the 'has location' constraint to search using the /location operator git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7633 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	0430a94eaa	the location search shows now not re-evaluated locations but only such locations that are attached as metadata to web pages - added parser for in-text appearing geo-locations - added geo-locations to rss search result - added evaluation of metadata-attached geo-locations in yacysearch_location to show search results within a map git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7631 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	9b25d07295	- added geo information parsing to html parser - extended metadata information in index with geolocalisation - added display of location in yacydoc and ViewFile git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7629 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	f3baaca920	- enhancements to DNS IP caching and crawler speed - bugfixes (NPEs) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7619 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	78d4c45d09	enhancement during search process: fast fail of search in case that all index feeder have terminated. This change should affect filtering and navigators and should cause that search navigation gets faster git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7614 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	a50f28e6e7	- fixed missing save operation for peer name change - fixed import of mediawiki dump files - added script to add mediawiki dump files git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7609 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	1989ebc24b	removed more warnings git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7598 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	8f11d3a5bb	redesigned the ScoreMap classes: - new concurrent score map using atom operation from java concurrency classes - redesigned difference beween StaticScore and Dynamic Score into ScoreMap and ReversibleScoreMap allowed that many classes can now use simple ScoreMap Objects which can be used better in concurrent environments using the ConcurrentScoreMap - switched from DynamicScore to ConcurrentScoreMap usage wherever possible git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7586 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	694fa3a2a5	- replaced more direct string-based UTF-8 conversions by predefined UTF-8 conversion - changed menu structure slightly git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7583 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	30aed9824a	moved getBytes() to UTF8.getBytes() to use a default String encoding git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7580 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
lotus	cb6d307bba	adding extension for parser git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7579 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	3820525464	more memory protection: auto-flush of caches in case of memory shortage git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7575 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	e1b6916423	always try to guess the size of a StringBuilder to prevent too many memory re-allocations git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7572 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
low012	3b40b98256	) set SVN properties ) minor changes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7567 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	cb1f49d0f2	replaced all 'new String' with default encoding (missing) or UTF-8 encoding with a String generation method that uses a pre-defined Charset constant for UTF-8. This avoids a cache-lookup for the Charset object using String hashing of the String 'UTF-8'. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7558 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	8d14916c74	more patches for a better out-of-memory management git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7555 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	f8d0454c53	small bug fixes and experiments with search speed enhancement git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7549 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	5e186e0122	continuing the fight against deadlocks during time formatting: better caching. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7531 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	a92d80a545	performance enhancements using an alternative to a insensitive collator (a complex string compare): - less synchronizations - better speed ..at most important and commonly used classes: http headers, url parsing and html parsing git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7526 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	e717bf74ba	more logging, more care about OOMs git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7503 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	5892fff51f	introduction of dht-burst modes: this can expand the number of target peers in some cases where a better heuristic is needed. The problematic cases are either when a muti-word search is made (still a hard case for our term-oriented DHT) or when a network operator wants that all robinson peers are asked. We therefore introduced two new network steering values that switch on more peers during the peer selection. Because the number of peers can now be very large, the number of maximum httpc connections was also increased. Please see new coments in yacy.network.freeworld.unit for details of the new DHT selection methods. The number of maximum peers is now not fixed to a specific number but may increase with - the partition exponent - the number of redundant peers - the robinson burst percentage - the multiword burst percentage The maximum can then be the number of senior peers (all visible peers). git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7479 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	4588b5a291	- fixed document number limitation for crawls that restrict the number of documents per domain - some restructuring of the document counting and logging structures was necessary - better abstraction of CrawlProfiles - added deletion of logs to the index deletion option (if the index is deleted using the servlets) which is necessary to reset the domain counters for the page limitation - more refactoring to get the LibraryProvider more clean - some refactoring of the Condenser class git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7478 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	0cdfb82963	replaced more appearance of double values by float values git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7461 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	eb12e15738	moved all Double values to Float values because of http://www.exploringbinary.com/java-hangs-when-converting-2-2250738585072012e-308/ YaCy does not really need double-precision floating point computation anywhere, so this should not affect any feature git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7460 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	88773e4daa	changed the default port from 8080 to 8090 see also: http://forum.yacy-websuche.de/viewtopic.php?p=21683#p21683 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7454 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
low012	9f38c0023d	*) Minor changes, mainly cleaning up a little bit, no functional changes. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7428 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	54e77e6255	refactoring git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7426 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	10ae8d961b	- cora package has now no dependencies to other yacy packages and becomes a 'base' package (refactoring) - cleaned up (removed special code and documentation for 27c3) - added remote search functions to be used within cora git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7420 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
low012	9eae33f886	*) Ooops... git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7406 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
low012	a001e8075c	*) minor enhancements git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7405 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
low012	11ea966f9e	) added SID file (Commodore 64) sound file parser ) minor changes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7403 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	3ca06d6290	patch for http://forum.yacy-websuche.de/viewtopic.php?p=21460#p21460 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7399 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
low012	936e976c23	) added FreeMind (http://freemind.sourceforge.net/) mindmap parser ) minor changes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7397 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
low012	3d95981f7d	) cleaning up the code a little bit ) minor changes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7396 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
low012	2a6499364d	*) minor changes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7395 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
low012	c0274bd123	*) minor changes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7394 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	59b70a5a92	another fix to the ftp crawler: now correct directory listings according to rfc2640 (path with spaces) and better title names for such files git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7386 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	9b25a33fd9	- fixed numerous bugs - better document names - fixed problem with ftp crawling - added automatic removal of search results from services that are not online according to the latest network scan: this does not delete the index but just does not show them. after the next network scan when the server is available again, the results are again showed. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7385 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	7bdb13bf7f	more fixes to smb crawling: better file names git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7384 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	c288fcf634	redesigned CrawlStartScanner user interface and added more features: - multiple hosts for environment scans can be given (comma-separated) - each service (ftp, smb, http, https) for the scan can be selected - the scan result can be accumulated or refreshed each time a network scan is made - a scheduler was added to repeat a scan and add all found urls to the indexer automatically git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7378 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
f1ori	9d2159582f	* fix system update if urls are in blacklist (for example for very general blacklists like *.de) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7375 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	56264dcc17	- added CamelCase parser to MultiProtocolURI: generate better to-be-indexed words from urls - integrated new parser into loader processes: enrich document parser - fixed a concurrent modification exception in kelondro iterator - hand-over of document size from crawler to indexer git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7374 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	a563b05b60	enhanced crawler: - added a new queue 'noload' which can be filled with urls where it is already known that the content cannot be loaded. This may be because there is no parser available or the file is too big - the noload queue is emptied with the parser process which indexes the file names only - the 'start from file' functionality now also reads from ftp crawler git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7368 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	c36da90261	added a very fast ftp file list generator to site crawler: - when a site-crawl for ftp sites is now started, then a special directory-tree harvester gets the complete directory structure of a ftp server at once - the harvester runs concurrently and feeds into the normal crawl queue also in this: - fixed the 'start from file' crawl function - added a link detector for the html parser. The html parser can now also extract links that are not included in <a> tags. - this causes that a crawl start is now also possible from clear text link files git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7367 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	4e2c14efbb	fixed bugs in parser and ftp client git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7360 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	f0651e5f2f	added image search to yacyinteractive.html this causes that the search result view switches from list format to image preview format when a search is restricted to png, gif or jpg documents git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7358 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
orbiter	b769cce433	- added a catch-all parser for all documents that cannot be parsed: they will contributed with their document url for the search index only - enhanced the pdf and torrent parser: better documents titles - enhanced the ftp client: more time-out time - fixed bugs in json for search results - enhanced yacyinteractive.html: added a file type navigator and a download-script generator for search result files Please have a look at yacyinteractive.html: this will become the hacker-download tool for 27c3! git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7355 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
low012	9b3fae9496	) cleaning up the code a little bit ) program to interface, not implementation git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7345 6c8d7289-2bf4-0310-a012-ef5d649a1542	14 years ago
low012	025e3f4790	) renamed classes according to standard Java coding conventions ) removed unsused code git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7328 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
f1ori	a025b1da89	* fix bug when browsing local filesystem (e. g. repository) with yacy git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7323 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	4c72885cba	added a sitemap entry parser and loader for sitemaps (a recursion if a sitemap refers to another sitemap) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7295 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	fb92f9ae8e	added mime type image/jpeg (image/jpg is wrong but it is left here because it does not harm and this error also exists in configuration of web servers) see also: http://forum.yacy-websuche.de/viewtopic.php?p=21129#p21129 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7279 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
f1ori	7d8de34778	* add a bit documentation to DigestURI, use DigestURI(string) instead of DigestURI(string, null) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7276 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	58e74282af	added a word counter statistic in condenser which is used by the did-you-mean to calculate best matches for given search words. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7258 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	de722090b5	enhancements in did-you-mean guessing git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7243 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	0d363a94d7	more performance hacks this makes YaCy search results VERY fast for all verify=false search cases and it enhances the search speed also for all other snippet-fetch cases. With this change my peer performed 100 Queries Per Second (!!!) while doing 10 queries simultanously (!!!) in an intranet index of 20000 URLs on my 16-core Mac Check this yourself by doing: cd bin ./searchtestmulti.sh after finishing the run, divide 1000 by the given time per query (which is the qps for one thread) and then multiply again by 10 (because 10 search threads has been started) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7231 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	b8aee6d402	performance hacks for better search performance git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7230 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	aacf572a26	- enhancements for search speed - bug fixes in many classes including basic data structure classes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7217 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	d2fd93135c	- moved yacybot user agent string definition to MultiProtocolURI since there are basic access mechanisms where the bot string is needed - migrated the 'yacy' user agent to 'yacybot' in many client methods since the 'yacy' user agent is only used for the proxy git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7199 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
f1ori	e670e1ef8e	add charset auto-detection for htmlParser git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7186 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
f1ori	ddcd5ae78c	fix for http://forum.yacy-websuche.de/viewtopic.php?f=5&t=2989 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7185 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
f1ori	8fe1102452	fix http://forum.yacy-websuche.de/viewtopic.php?p=20889#p18426 reuse code from htmlParser git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7184 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	10a9cb1971	simplified snippet computation process and separated the algorithm into two classes also enhances selection criteria for best snippet line computation git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7182 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	84a023cbc8	fixed several search bugs git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7180 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
lotus	d2a3d08c44	avoid div. by zero git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7136 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	570ca577c6	performance hacks git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7129 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	114bdd8ba7	fixed old sitemap importer which was not able to parse urls containing post elements - removed old parser - removed old importer framework (was only used by removed old parser) - added a new sitemap parser in parser framework - linked new parser with parser access in old sitemap processing routines git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7126 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	c0b08ac59b	slighlty changed way of pdf parser integration git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7124 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	5fe828fa06	- replaced pdfbox and fontbox version 1.1.0 with 1.2.1 - added some clear statements that shall clear static cache size within the pdfbox library - the pdfbox library contains a memory leak; it is unsafe to run a peer with pdf parser permanently on. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7120 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	24502fe3de	performance hacks git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7116 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	22047ffad5	enhanced computation speed of many replaceAll string operations git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7107 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	3988a95fb5	added ability in rss reader to parse atom feeds git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7094 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	9d080f387e	change in handling of the all-visible home path for storage in YaCy: the home path can now be distinguished between - data home; the path where the DATA directory is created - application home; everything else This will make it possible to store application data on Mac releases within the ~/Library/YaCy directory; a place where Mac applications write their data. Similar techniques will be possible for debian and windows. To use the new data path, YaCy can be started with -start <data path> or -gui <data path> git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7092 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	0010cd9db1	Support for indexing of RSS feeds! - added a scanning in html parser for rss feeds - storage of rss feed addresses, can be viewed with http://localhost:8080/Tables_p.html?table=rss - rss items retrieved by http://localhost:8080/Load_RSS_p.html (in Index Creation menu) can be selected and indexed - a rss feed retrieved in http://localhost:8080/Load_RSS_p.html can now be fully indexed - indexing of rss feeds can be placed in scheduler git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7073 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	844f158686	- removed dependencies in header framework: moved http date methods from DateFormatter to HeaderFramework changed logging to log4j - added ftp load access to MultiProtocolURI - ensured termination of RSS feed iteration git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7067 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	5e7081cd19	refactoring towards a unified loading mechanism for MultiProtocolURIs git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7065 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	e10cd115a9	- added a new RSS reader interface. This is not finished but you can now load and look at RSS feeds. It will be used to index RSS feeds in a way that is appropriate for such kind of data. - refactoring of Mediawiki and PHPBB3 loader interface names (just renamed) - removed two old not used RSS loader interfaces - fixed a bug in RSS parser library of cora - added a new RSS parser component to the set of yacy document parsers git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7053 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	933dc1a600	removed old rss parser (will be replaced with parser from cora package) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7052 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	5924a0d851	- enhanced concurrency in database index access for multicore - added statistics about database index caches in PerformanceMemory_p.html - adoped many classes to use the new statistics - added missing close statements git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7018 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	989948e1a9	fixed generic image parser git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7005 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	27d8a8b53e	removed wrong com.sun.codec class access in generic image parser git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@7003 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	b6fb239e74	redesign of parser interface: some file types are containers for several files. These containers had been parsed in such a way that the set of resulting parsed content was merged into one single document before parsing. Using this parser infrastructure it is not possible to parse document containers that contain individual files. An example is a rss file where the rss messages can be treated as individual documents with their own url reference. Another example is a surrogate file which was treated with a special operation outside of the parser infrastructure. This commit introduces a redesigned parser interface and a new abstract parser implementation. The new parser interface has now only one entry point and returns always a set of parsed documents. In case of single documents the parser method returns a set of one documents. To be compliant with the new interface, the zip and tar parser had been also completely redesigned. All parsers are now much more simple and cleaner in its structure. The switchboard operations had been extended to operate with sets of parsed files, not single parsed files. additionally, parsing of jar manifest files had been added. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6955 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
low012	d4851441b0	*) Added Android packages to parser in order to be able to create a decentralized search for direct downloads of Android apps. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6953 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	150cf42a1b	migrated all my LGPL 3 -licensed files to the LGPL 2.1 because LGPL 3 is not compatible to the GPL 2 see http://www.gnu.org/licenses/license-list.html for explanation Since (as far as I know) nobody else has ever contributed to these files I may be allowed to just apply an older license. You may consider this as a dual-licensing and may use and optionally replicate the older files under GPL 3. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6952 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	5a4684f21f	allow words with length >= 2 (you can't search for 'wm' with 3-letter words...) lets try that. If we run into a memory problem because of too many 2-letter-words, then we must introduce whitelists for 2-letter words. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6947 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	37b8827a7a	- removed the UPnP library sources from sbbi and added the jar library again. The library was included to get support for fedora releases, but after this time the fact that the sbbi cannot be part of fedora should be re-discussed. If this will still not be possible, then we may integrate the sbbi UPnP package using reflection. - cleaned uo the code. The new eclipse helios provided new warnings for dead code. This change cleans up most of these warnings git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6945 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	777195e8d1	more abstraction for access of LoaderDispatcher and cache git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6937 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	7bcfa033c9	more abstraction of the htcache when using the LoaderDispatcher: a cache access shall not made directly to the cache any more, all loading attempts shall use the LoaderDispatcher. To control the usage of the cache, a enum instance from CrawlProfile.CacheStrategy shall be used. Some direct loading methods without the usage of a cache strategy have been removed. This affects also the verify-option of the yacysearch servlet. If there is a 'verify=false' now after this commit this does not necessarily mean that no snippets are generated. Instead, all snippets that can be retrieved using the cache only are presented. This still means that the search hit was not verified because the snippet was generated using the cache. If a cache-based generation of snippets is not possible, then the verify=false causes that the link is not rejected. git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6936 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	87087f12fe	- scanned remote search process and enhanced some data structure and synchronizations here and there - removed concurrency overhead for small number of index normalizations as it happens during remote search - removed 'load only parseable' constraint for snippet fetch because some resources may not have any url file extension and these had therefore not been parseable and searcheable since they may become parseable after loading when their mime type is known - this partly fixes some problems with http://forum.yacy-websuche.de/viewtopic.php?p=20300#p20300 but more changes are necessary to get all expected search results git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6926 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	de4f30bb2e	UTF-8 fix git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6923 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	3a1cebb598	bugfixes git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6922 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	60e71876ad	- more abstraction (HashMap -> Map) - more concurrency-awareness (HashMap -> ConcurrentHashMap) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6910 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	2eea806005	less errors in image parser git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6907 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	11639aef35	- added new protocol loader for 'file'-type URLs - it is now possible to crawl the local file system with an intranet peer - redesign of URL handling - refactoring: created LGPLed package cora: 'content retrieval api' which may be used externally by other applications without yacy core elements because it has no dependencies to other parts of yacy git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6902 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	98c1d65415	- show up to 10 locations (maps) after search (instead of a max of 5) - order locations by (primary) population and (secondary) longitude (reverse ordering, both) - added population from GeoNames, OpenGeoDB does not have that information - changed default viewpoint of map to (30,15); shows more land and europe in the center git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6893 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	9842fab6e4	- fixes to query parameter - replaced/removed search query attribute (was old style, new is 'query' according to SRU) git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6892 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	1defd580bc	- added option to localization search to distinguish between a search for a location according to the search word only or for the relation between a web search results and locations found in the metadata fields - used that to display two layers on map: cities and search result locations - added many marker grafics for the display of the markers on the map - some refactoring of the yacy news code plus bugfixes for latest move from Tree to Table data structure git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6889 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	bd0a9df895	fix for bad location double check git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6884 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	e43e61e502	added another geolocalization data source: GeoNames - added downloader option in DictionaryLoader - added generalization (interfaces and overarching localization) - more abstraction using the libraries git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6879 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	2126c03a62	- removed download-limit that can be given for the crawler for non-crawler download tasks. This was necessary because the same procedure was used for other downloads like for the download of dictionary files where a limit is not useful. The limit still stays for the indexer - migrated the opengeodb downloader to a new version of the opengeodb-dump git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6873 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	789c6b26ce	added a location search service: using the following servlet/example: http://localhost:8080/yacysearch_location.kml?query=berlin&maximumTime=2000&maximumRecords=100 This will open any application that can consume kml data (which will probably be google earth) on your computer and displays the search result as positions on a map git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6865 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago
orbiter	f23cbd2dab	more bugfixes to date parser git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@6864 6c8d7289-2bf4-0310-a012-ef5d649a1542	15 years ago

... 2 3 4 5 6 ...

422 Commits (049c3b3f2ee2f7ec9467751cf42328f7b0738eb7)