luccioman
f0639d810c
Customized name for Threads still using the default "Thread-n" pattern.
...
This makes threads monitoring easier to read.
8 years ago
luccioman
47af33a04c
Advanced Crawl from local file : better processing of large files.
...
Applied strategy : when there is no restriction on domains or
sub-path(s), stack anchor links once discovered by the content scraper
instead of waiting the complete parsing of the file.
This makes it possible to handle a crawling start file with thousands of
links in a reasonable amount of time.
Performance limitation : even if the crawl start faster with a large
file, the content of the parsed file still is fully loaded in memory.
8 years ago
luccioman
7717a3d43d
Fixed license headers on files created to improve favicon management.
8 years ago
luccioman
6e1959f469
Merge branch 'master' of https://github.com/yacy/yacy_search_server.git
...
Conflicts:
htroot/yacysearchitem.java
source/net/yacy/cora/federate/solr/responsewriter/YJsonResponseWriter.java
source/net/yacy/search/schema/CollectionConfiguration.java
source/net/yacy/server/serverObjects.java
8 years ago
reger
b752bcfecb
adjust date in text detection to ignore some program version strings
...
like "3.1.2.0102" see http://mantis.tokeek.de/view.php?id=650
+ expand test case
8 years ago
reger
b017e97421
optimize condenser language detection a little.
...
langdetect probabilities take letter case into account, add words from
description and anchors etc. as is.
+ add it to javadoc
8 years ago
reger
ae3717d087
adjust Tokenizer sentence count to ignore repeated punktuation (like !!!! )
...
+ remove unused sentenceword map (we use only the count)
+ upd test case for sentence count
8 years ago
reger
474f0476c6
adjust Tokenizer sentence count on trailing text after last recognized sentence
...
+ upd test case for rwi multi-word-query (leaving results known to fail untested)
8 years ago
reger
14f7577231
add support for older Word versions (Word6/Word95) to docParser
8 years ago
reger
1a79c64495
generalize DateDetection with holiday date rules readily available in icu
...
to make sure current dates are recognized (was fixed to 2014 - 2016)
+ adjust holiday date parser from pattern.match to pattern.find to deal with leading and trailing text
+ moved relative date recognition (morgen, tomorrow) to parseline (used by query parser only), as not working and problematic for indexing
+ add test case for parseline (used by query parser)
8 years ago
reger
6f68f08354
correct DateDetection Silvester date
...
add Thanksgiving
8 years ago
reger
efcb6a1e74
fix supported mime XML -> xml for rssParser (mime normalized to lower case for comparison)
...
+ add mime text/xml as in use for rss in the wild
8 years ago
luccioman
b1b8e69da8
Fixed NullPointerException cases
8 years ago
reger
a4465c97d6
as requested, disable/remove old swf parser
...
http://forum.yacy-websuche.de/viewtopic.php?f=8&t=5861#p33098
9 years ago
reger
96467c5467
remove not needed counter in Tokeninzer (completing last changes)
...
including a small change, word posintext counting.
We remember/store 1st posintext. Previously following words got a handle (posintext)
excluding found. Now it just counts and assigns true posintext as handle (posintext)
9 years ago
reger
272cdd496a
reactivate sentence counter in WordTokenizer for phrasepos ranking,
...
by counting punktuation (delivered as 1 char word) again.
9 years ago
Michael Peter Christen
5e165a8150
removed unused imports
9 years ago
reger
e310ec5f70
fix posInText ranking calculation to score 0 on no position info
...
+ fix Word posInText calc in Tokenizer to start with 1
+ test case
9 years ago
reger
4c7a77662a
eleminate dependency on file-extension in storeDocument but use supported mime-type
...
to also support handling of urls w/o corresponding file-extension.
For this refactor use of document.getParserObject() to alway return a Parser (for clean logic)
and define/move the scraperObject as local var of AbstractParser.
Adjust related calls to getParserObject (where actually a scraperObject is wanted).
Addionally skip appending url token to parsed text for dht metadata entries
(by default returned as result by rwi index).
9 years ago
reger
ebde21079a
refactor xlsParser to include Excel file attribute (like author) in parser result doc.
...
Similar to ppt and doc parser, completing a TODO in xlsParser.
9 years ago
reger
27163af0e1
improve detection of referenced links by taking http and https link protocol
...
into account
+ correct query start detection of commit f89d4eb51d
9 years ago
luccioman
6e96c7341a
Merge remote-tracking branch 'origin/master'
...
Conflicts:
htroot/Load_MediawikiWiki.java
htroot/Load_PHPBB3.java
htroot/ViewImage.java
9 years ago
reger
9e94989237
upd to PDFBox 2.0.1
9 years ago
reger
24b0fa2a38
extend snapshot Html2Image.pdf2image to use PDFBox image export capability
...
if no external tool installed (and for Win)
Resulting jpg are not always perfect (if graphic included) but imho sufficient.
9 years ago
reger
1d940e5a94
upd commons-compress 1.11
9 years ago
reger
764f5100f0
fix delete of temp file after odt % ooxml parser
...
Close zipfile after parsing
9 years ago
reger
06d0e2aeb9
result heuristic (also used in greedy learning mode) to use outbound links if result is full index doc. Otherwise use default loader methode.
...
- Above brought up that parser start url parameter, declared as AnchorURL uses only methodes of parent object DigestURL (changed parameter declaration accordingly).
9 years ago
luc
9f712146df
Display icons in ViewFile "links" mode.
9 years ago
reger
6f0b073bf3
override detected language (statistic langdetect) only with TLD determided
...
language if langdetect probability is not high.
+ additionally truncate zh-cn / zh-tw returned by langdetect to 2 char ISO639-1 zh
used by YaCy
9 years ago
reger
b65e2b527d
include use of condenser's content text for language detection.
...
Language identification may show poor performance on documents with short or no
title but clear lang indication in text content. Using content text too
improves lang detection.
+ remove double caching of text in Identificator
9 years ago
luc
3cc5619d93
Improved HTML icons indexing and rendering in search results.
...
See http://mantis.tokeek.de/view.php?id=629
9 years ago
reger
2048b7e057
support scraping start-/enddate from html tag with property "datetime"
...
This may be used in html5 <time> tag (which we don't explicite support yet for date in content scraping).
9 years ago
reger
900d4584ba
complet resource cleanup of lists in contentscraper's close()
9 years ago
reger
1f18653de0
pass parsed swf content trough htmlscraper
...
Swf may contain subset of html tags which shoul'd appear as text.
Especially <font> tag may totally screw up metadata servlet if not filtered out.
9 years ago
reger
18ecf57792
add support of compressed swf to swfParser
...
from JavaSWF2 (source compatible to WebCat).
Moved swf file signature check to parser
Changed use of synced vector to list swf InStream
9 years ago
reger
ff27824964
fix swfParser reading file signature
...
before passing to library (current version expects data w/o signature)
9 years ago
luc
571bc55937
Refactoring : use StandardCharsets constants instead of hard-coded
...
charset names.
9 years ago
reger
46ac0867ff
fix poison mediawikiimporter output queue also after ExecutionException
...
in worker thread.
Writer of importer keeps needs a poison to close the file. On exception (e.g. OOM)
add a poison marker in outer most try/catch to assure output queue will terminate
in this condition too (and closes+renames the surrogate/in/xxx.prt file)
9 years ago
reger
a7591d3ed0
fix mediawikiimporter number format exception on coordinate parsing
...
handle uncomplete metadata like "NS=43/50//N".
For other {expr ... } type entries a try catch added
9 years ago
reger
e84d94f8ca
fix mime table for ms office / open office documents
...
(causing wrong parser detect in intranet mode)
9 years ago
reger
45b9bd8403
adjust MultiProtocolURL.protocol detection to handle mailto with "://" in parameters,
...
and feeding hyperlinks to webgraph processing.
9 years ago
reger
0c5548a7ff
fix (todo) remove redundant holding of email link nameproperty in parser document
9 years ago
reger
6b7c10cef8
fix dc:date in mediawikiimporter/document.writexml to use lastmodified
9 years ago
reger
14803d58cd
let html scraper accept html5 <link rel="icon"> for favicon links
9 years ago
reger
4d2b934487
prevent mailto links getting into parser result document's in/outbound link collection
...
by checking mailto scheme early.
- fix upper case mailto protocol assignment
- add test case for getProtocol
9 years ago
luc
8ebefa4233
Fixed MediaWiki import : DCEntry conversion to SolrInputDocument was
...
failing. Looks like it was broken since Commit
b43811d38c
9 years ago
luc
7736ee5a42
Updated MediaWimporter main() : display usage in console and stop
...
properly without calling System.exit
9 years ago
luc
27d11f8671
Fixed isSolrDump function : PushBackInputStream was not unread when
...
returning false (for example with a WikiMedia dump).
9 years ago
Michael Peter Christen
135a123a77
less logging in new language detection
9 years ago
Michael Peter Christen
d6e9834040
Merge branch 'master' of
...
https://github.com/Scarfmonster/yacy_search_server
# Conflicts:
# .classpath
# build.xml
9 years ago