二项堆能在Ologn 的时间内支持最坏情况的堆支持操作INSERT,MINUMUM,EXXTRACT-MIN,UNION.
二叉堆最坏情况时间下合并两个二叉堆需要O(n) 所以二项堆优于二叉堆
斐波那契堆对于二项堆有所改进。用平摊时间计算性能。INSERT,MINIMUM,UNION,DECREASE-KEY 需要O(1),extract-min,delete 需要O(logn)
Tuesday, December 29, 2015
Maven tips
build a new maven project:
mvn archetype:generate -DgroupId=com.mycompany.helloworld -DartifactId=helloworld -Dpackage=com.mycompany.helloworld -Dversion=1.0-SNAPSHOT
cd helloworld
mvn package
java -cp target/helloworld-1.0-SNAPSHOT.jar com.mycompany.helloworl
help to get more information
mvn help:effective-pom
clean the target folder
mvn clean
mvn compile
mvn test
mvn install
mvn compile
mvn test
mvn install
Wednesday, December 2, 2015
Wednesday, November 11, 2015
java tips
1. enumeration:
public enum SimilarityEnum {
Monday, November 9, 2015
1.Word2Vec basic introduction(Chinese):
2.java version:
./word2vec -train text8 -output vectors.bin -cbow 0 -size 48 -window 5 -negative 0 -hs 1 -sample 1e-4 -threads 20 -binary 1 -iter 100
以上命令 -train text8 表示的是输入文件是text8,-output vectors.bin 输出文件是vectors.bin,-cbow 0表示不使用cbow模型,默认为Skip-Gram模型。-size 48 每个单词的向量维度是48,-window 5 训练的窗口大小为5就是考虑一个词前五个和后五个词语(实际代码中还有一个随机选窗口的过程,窗口大小小于等于5)。-negative 0 -hs 1不使用NEG方法,使用HS方法。-sampe指的是采样的阈值,如果一个词语在训练样本中出现的频率越大,那么就越会被采样。-binary为1指的是结果二进制存储,为0是普通存储(普通存储的时候是可以打开看到词语和对应的向量的)除了以上命令中的参数,word2vec还有几个参数对我们比较有用比如-alpha设置学习速率,默认的为0.025. –min-count设置最低频率,默认是5,如果一个词语在文档中出现的次数小于5,那么就会丢弃。-classes设置聚类个数,看了一下源码用的是k-means聚类的方法。要注意-threads 20 线程数也会对结果产生影响。
· 架构:skip-gram(慢、对罕见字有利)vs CBOW(快)
· 训练算法:分层softmax(对罕见字有利)vs 负采样(对常见词和低纬向量有利)
· 欠采样频繁词:可以提高结果的准确性和速度(适用范围1e-3到1e-5)
· 文本(window)大小:skip-gram通常在10附近,CBOW通常在5附近
4.word2vec 的輸入必須是以空白隔開的詞
5. word2vec mathematics
sigmoid function
-cbow 1 -size 300 -window 5 -negative 3 -hs 0 -sample 1e-5 -threads 12 -binary 1 -iter 15
Lucene tips
1. how to extract query and analyse
eg: "new new new york" --> "new" and " york"
String queryString="New New New York";
Query query = parser.parse(queryString);
Set<Term> queryTerms = new LinkedHashSet<Term>();
searcher.createNormalizedWeight(query, false).extractTerms(queryTerms);
2. scan input
Scanner input=new Scanner(System.in);
3. Multifileds
Exposes flex API, merged from flex API of sub-segments. This is useful when you're interacting with an IndexReader implementation that consists of sequential sub-readers (egDirectoryReader or MultiReader).
4. get current path
curDir = System.getProperty("user.dir");
5. set similarity function:
String queryString = "police";
String index = "/Users/chunguo/Downloads/index";
IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths
IndexSearcher searcher = new IndexSearcher(reader);
QueryParser parser = new QueryParser("TEXT", analyzer);
Query query = parser.parse(queryString); System.out.println("Searching for: " + query.toString("TEXT"));
TopDocs results = searcher.search(query, 1000);
//Print number of hits
int numTotalHits = results.totalHits; System.out.println(numTotalHits + " total matching documents");
//Print retrieved results
ScoreDoc[] hits = results.scoreDocs; for(int i=0;i<hits.length;i++){
Document doc=searcher.doc(hits[i].doc); System.out.println("DOCNO: "+doc.get("DOCNO"));
Yo can make Lucene ignore the special characters by sanitizing the query with something like
query = QueryParser.Escape(query)
If you do not want your users to ever use advanced syntax in their queries, you can do this always.
Thursday, November 5, 2015
pack an eclipse project to jar file
选中工程---->右键,Export...--->Java--->选择JAR file--->next-->选择jar file的路径及名称-->next-->next--- 选择Main class--->finish.
1.生成manifest file:
选中工程---->右键,Export...--->Java--->选择JAR file--->next-->选择jar file的路径及名称-->next-->next--- 选择Main class--->finish。此时来到了要选择Main class的窗口,但这里先不选择Main class,选择最上方的Generate the manifest file(生成清单文件), 勾上Save the manifest in the workspace(将清单保存在工作空间中),在Manifest file(清单文件):中输入file的名称,如/testProject/main,(其中testProject为工程名,main为manifest file的名称)点击finish。
生成的jar文件一运行就会产生错误:couldn't find main class
Manifest-Version: 1.0
Main-Class: com.pacong.convert.auto.propertes.ConvertAutoProperties
Class-Path: jxl.jar
其中,Manifest-Version: 1.0为刚刚自动生成的manifest版本号;
Main-Class: com.pacong.convert.auto.propertes.ConvertAutoProperties为Main class所在类;
Class-Path: jxl.jar为外部jar包名称,这里是告诉导出的jar包,所要用到的外部的jar包的路径及名称。
3.选中工程--->右键,Export--->Java--->JAR file--->选择jar file的路径及名称--->next--->next--->next,这时又来到了选择main class的窗口,但这里也不选择main class,选择Use existing manifest from workspace,在Manifest file:里选择刚刚生成的main,如:/testProject/main,点击finish,到此就大功告成啦;假设生成的jar包的名称为test.jar,那么Class-Path中引入的jxl.jar要和test.jar位于同一个目录下。
不要选main.mf和manifest文件,在后面的选择existing manifest文件,变写manifest时:
Manifest-Version: 1.0
Main-Class:com.test.GameFrame ; class:后面一定要有且只有1个空格
Class-Path: nimrodlf-1.2.jar liquidlnf.jar ;path 后面一定要有且只有1个空格
最后必须有换行,不能有空格. 这个也是要注意的. 否则依然提示没有mainclass
jar creation failed
detail:invalid header field
用export runnable jar file时,不需要 么Class-Path中引入的jxl.jar要和test.jar位于同一个目录下。比较简单。
C:\Documents and Settings\Administrator\桌面>java -jar test.jar
C:\Documents and Settings\Administrator\桌面>java -jar test.jar >log.txt
这里也可以新建一个start.bat文件,创建后;右键---编辑,输入:java -jar test.jar >log.txt;以后直接双击start.bat即可运行test.jar啦。
第三种方案.用Fat jar插件来打包有引用外部jar包的项目(J2SE)
1 如果你在程序开发的时候用到了第三方提供的API或者其它附属资源,在你导出生成jar文件时你必须将第三方的文件一并导出,不然你的程序运行将达不到你想要的结果.
你可以利用Fat jar插件生成jar,Fat jar 的下载地址:http://sourceforge.net/projects/fjep/ 下载的文件是net.sf.fjep.fatjar_0.0.31.zip,解压缩后你会看到plugins文件夹中的net.sf.fjep.fatjar_0.0.31.jar文件(简称Fat jar).
插件安装:把Fat jar文件拷贝到你的eclipse目录下的plugins中..重新启动你的eclipse平台,然后查看Window---preferences, 弹出的窗口中有Fat jar preferences这一项则说明你已经安装成功了,没有这一项则未安装成功,需要多做如下几步操作:到你的eclipse目录下的 configuration---org.eclipse.update,并删除platform.xml文件,然后重新启动eclipse.查看 Window---preferences,你会发现Fat jar preferences这一项.恭喜你安装成功了2.右击你的项目,选项列表中有Build Fat jar这一项,选择它,点击Browser(Main-Class选择你的主类)---next--- finish(勾选你需要一起打包的资源,默认为全部勾选).瞧,jar文件就生成在你的工程项目中了.运行它,一切顺利.
插件安装:把Fat jar文件拷贝到你的eclipse目录下的plugins中..重新启动你的eclipse平台,然后查看Window---preferences, 弹出的窗口中有Fat jar preferences这一项则说明你已经安装成功了,没有这一项则未安装成功,需要多做如下几步操作:到你的eclipse目录下的 configuration---org.eclipse.update,并删除platform.xml文件,然后重新启动eclipse.查看 Window---preferences,你会发现Fat jar preferences这一项.恭喜你安装成功了2.右击你的项目,选项列表中有Build Fat jar这一项,选择它,点击Browser(Main-Class选择你的主类)---next--- finish(勾选你需要一起打包的资源,默认为全部勾选).瞧,jar文件就生成在你的工程项目中了.运行它,一切顺利.
2.用Fat jar打包:右击你的项目,选项列表中有Build Fat jar这一项,选择它,弹出的窗口你会觉得特别亲切,一目了然使用很方便,点击Browser(Main-Class选择你的主类)---next--- finish(勾选你需要一起打包的资源,默认为全部勾选).瞧,jar文件就生成在你的工程项目中了.运行它,一切顺利.
我的eclipse,把插件复制到plugin下,删除了Platform.xml文件,且启动用eclipse.exe -clean 还是不行,插件安装不了,不知道怎么回事。
通过eclipse的Export 导出项,导出jar包时,有几点需要注意的事项。
假如我们导出的jar包名称为 demo.jar,右键解压demo.jar,META-INF 目录下面有 MANIFEST.MF文件,打开看到
- Manifest-Version: 1.0
- Main-Class: com.zhangqi.you.main.JdbcTest
1 Main-Class事项:
main-class,顾名思义,主函数类,即demo.jar 默认运行时,执行的主方法类。
在eclipse 导出jar的过程中,可以指定main-class属性,也可以不指定main-class,根据实际情况而定。
例如 java -jar demo.jar com.test.Demo1 运行 Demo1 类
java -jar demo.jar com.test.Demo2 运行 Demo2 类
2 Class-Path事项:
如果导出的jar包中引用了外部的jar包,那么直接运行java -jar demo.jar时,会报classnotfound异常的,这时候,需要为导出jar指定classpath
打开MANIFEST.MF文件,在下面添加上 Class-Path: mysql-connector-java-5.0.8-bin.jar 便为demo.jar添加了mysql-connector-java-5.0.8-bin.jar的引用
- Manifest-Version: 1.0
- Main-Class: com.zhangqi.you.main.JdbcTest
- Class-Path: mysql-connector-java-5.0.8-bin.jar
同时将mysql-connector-java-5.0.8-bin.jar 驱动jar包放到跟demo.jar同目录下即可引用到。
Using git to manage your code
Git 教程:https://www.liaoxuefeng.com/wiki/896043488029600
1. how to use github.iu?
github iu is an enterprise type of github.
You can add an account of github and an account of github,iu to desktop github.
By using desktop version, you can use two accounts both. But for terminal, you'd better specify an account. I choose to use github.iu.edu account.
You need to generate SSH keys for git to link to your local computer.
2. how to simultaneously use two accounts in github desktop?
I think we can, but using one account I think is enough. We can use desktop version github to control.
3. how to realize the version control by using the git command?
upload local change to repository:
git reset --mixed origin/master
help to check each commit
shift+q exit log history
how to upload the change to remote github
how to pull new changes from github
An easy way to collaborate two computers in git
如果gitignore 声明晚了,可以用参考 https://www.jianshu.com/p/e5b13480479b 比如用
git rm -r --cached .
如果想取消之前git add .的操作的话,可以看下git status. 然后
rm -f ./.git/index.lock
Run java code in server&terminal
java -jar test.jar
if you have store the printout into a text, use the following command:
java -jar test.jar >log.txt
The main class is defined when the code is exported from eclipse.
1. how to run different main classes in a same jar file?
2. how to define the memory usage for executing the jar?
JVM default heap size.
eg: -Xms6400K,-Xms256M
JVM max heap size
eg: -Xmx81920K,-Xmx80M
so the command can be: java -jar test.jar >log.txt -Xms<size> -Xmx<size>
if you don't name main class when export, you can name main class flexibly.
java -cp myjar.jar MyClass
how to import extra jar package into jar file
simple ways to export jar with library
export ->runnable jar file-> launch configuration(decide which is main class, which means a jar can only have 1 main class to run ), fill the export destination(eg. a.jar)->finish
nohup java -Xms4g -Xmx4g -jar centrality.jar 8edges.txt output.txt 10 100&
Nohup notes
Nohup is a function which can help you run code even you exit your terminal. All the print out information will be saved in a nohup.out file.
Some of the terminal functions:
nohup gradle run&
ps pid
pgrep sleep
kill 5234
nohup sleep 200&
fg 1
echo -e "OK! \h" # -e 开启转义
Tuesday, September 1, 2015
What is a "document"?
Loosjes (1962, pp. 1-8) explained documentation in historical terms: Systematic access to written texts,
he wrote, became more difficult after the invention of printing resulted in the proliferation of texts;
scholars were increasingly obliged to delegate tasks to specialists; assembling and maintaining collections
was the field of librarianship; bibliography was concerned with the descriptions of documents; the
delegated task of creating access for scholars to the topical contents of documents, especially of parts
within printed documents and without limitation to particular collections, was documentation.
Photo of star -- Yes
Stone in river -- No
Stone in museum -- Yes
Animal in wild -- No
Animal in zoo -- Yes
Object --- Document?
Star in sky -- No
Three meanings of "information" are distinguished: "Information-as-process"; "information-as-
knowledge"; and "information-as-thing", the attributive use of "information" to denote things regarded as
informative. The nature and characteristics of "information-as-thing" are discussed, using an indirect
approach ("What things are informative?"). Varieties of "information-as-thing" include data, text,
documents, objects, and events. On this view "information" includes but extends beyond communication.
Whatever information storage and retrieval systems store and retrieve is necessarily "information-as-
These three meanings of "information", along with "information processing", offer a basis for classifying disparate information-related activities (e.g. rhetoric, bibliographic retrieval, statistical analysis) and, thereby, suggest a topography for "information science".
INTANGIBLE TANGIBLE ENTITY 2. Information-as-knowledge 3. Information-as-thing Knowledge Data, document PROCESS 1. Information-as-process 4. Information processing Becoming informed Data processing
The literature on information science has concentrated narrowly on data and documents as information resources.
3. Events can, to some extent, be created or re-created. In experimental sciences, it is regarded as being
of great importance that an experiment -- an event -- be designed and described in such a way that it can
be replicated subsequently by others. Since an event cannot be stored and since accounts of the results are
no more than hearsay evidence, the feasibility of re-enacting the experiment so that the validity of the
evidence, of the information, can be verified is highly desirable.
"Information-as-thing", then, is meaningful in two senses: i) At quite specific situations and points in time an object or event may actually be informative, i.e. constitute evidence that is used in a way that affects someone's beliefs; and (ii) Since the use of evidence is predictable, albeit imperfectly, the term "information" is commonly and reasonably used to denote some population of objects to which some significant probability of being usefully informative in the future has been attributed. It is in this sense that collection development is concerned with collections of information.
Numerous definitions have been proposed for "information". One important use of "information" is to denote knowledge imparted; another is the denote the process of informing. Some leading theorists have dismissed the attributive use of "information" to refer to things that are informative. However, "information-as-thing" deserves careful examination, partly because it is the only form of information with which information systems can deal directly. People are informed not only by intentional communications, but by a wide variety of objects and events. Being "informative" is situational and it would be rash to state of any thing that it might not be informative, hence information, in some conceivable situation. Varieties of "information-as-thing" vary in their physical characteristics and so are not equally suited for storage and retrieval. There is, however, considerable scope for using representations instead.
The Role of Facts in Paul Otlet’s Modernist Project of Documentation
Otlet’s writing is marked by
its extraordinary energy. His muscular and tireless prose powers a huge machine,
a vast assemblage of interconnected parts operating on multiple strata according
to precisely articulated scales.
hyper-modernist energy that provokes the question of this chapter: how are we to understand the co-ordination and organization of the vectors assembled and animated in Otlet’s project of documentation? To use a mathematical trope, what is the singularity that organizes this vector field?
To understand Otlet on facts, it is useful to start with his distinction between natural and social science. The former builds a ‘great monument’ through disciplined,collective work. For natural scientists, ‘speculation and interpretation are secondary’, whereas ‘the social sciences are seen ... not as one discipline ... but as a gathering of personal opinions’ (11).
science is built upon facts: ‘The results of the natural sciences are grounded in millions of carefully observed, analysed, and catalogued facts. These facts have subsequently been integrated into sequences and the combination of these sequences has naturally led to the enunciation of laws, partial at first, general later, from which the most powerful and indestructible synthesis that has ever been made now seems possible’ (11).
The creation of documents in such a way that each item of information has its own identity
A fact can be fully revealed to the consciousness of a reader only when the ‘item of information’ that refers to that fact is constructed and organized in such a way, as Rayward put it, ‘that each item of information has its own identity.’ If the identity of the item of information is somehow compromised, so is the revelation of the fact.
The application of the monographic principle does not complete the documentation
of facts
The results of the natural sciences are grounded in millions of carefully observed,
analysed, and catalogued facts. These facts have subsequently been integrated into
sequences and the combination of these sequences has naturally led to the enunciation
of laws
trimmed stone
Monday, August 31, 2015
09/04/2015 Reading
1. The Big Toe
history of big toe or foot in different countries such as Spain and china.
2.What is Documentation?
Sunday, August 30, 2015
Chapter 2 Vocabulary
1. Posting skip pointers. for a postings
list of length P, use √P evenly-spaced skip pointers. This heuristic can be
improved upon; it ignores any details of the distribution of query terms.
2. Most recent search engines support a double quotes syntax (“stanford university”) for phrase queries, which has proven to be very easily understood and successfully used by users.
3. The concept of a biword index can be extended to longer sequences of words, and if the index includes variable length word sequences, it is gen- erally referred to as a phrase index.
4.For the reasons given, a biword index is not the standard solution. Rather, a positional index is most commonly employed.
5.Let’s examine the space implications of having a positional index. A post- ing now needs an entry for each occurrence of a term. The index size thus depends on the average document size. The average web page has less than 1000 terms, but documents like SEC stock filings, books, and even some epic poems easily reach 100,000 terms. Consider a term with frequency 1 in 1000 terms on average. The result is that large documents cause an increase of two orders of magnitude in the space required to store the postings list:
Expected Expected entries
Document size postings in positional posting
1000 1 1 100,000 1 100
Wednesday, August 26, 2015
How to give more persuasive presentations: A Q&A with Nancy Duarte
The speaker needs the audience more than the audience needs the speaker.
Going from sounding memorized and canned to sounding natural
is a lot of work.
What is the best way to start creating a presentation?
My best advice is to not start in PowerPoint.
But on the stage, you have to move your body in really big gestures.
Tuesday, August 25, 2015
Sunday, August 23, 2015
Keynote address: Library and information science as a research domain: problems and prospects
Most recently, LIS schools were accused by leaders of
the profession of failing to educate students appropriately for the workplace and
of engaging in esoteric and irrelevant research that was out of touch with real
world needs.
While, A
community of information schools known as the "iSchool Caucus" has been
founded that has no affiliation with a professional association in LIS yet it
contains significant numbers of the leading LIS programs in North America.
LIS:two camps: the library and the information sides.
we might all
agree generally that issues of information retrieval, information quality and
authenticity, policy for access and preservation, the health and security
applications of data mining, raise at least some big questions for information
research to study.
The points made about IR can be made more or less equivalently, I'd argue, for
many other of the current hot topics in information research. We are at the
party, so to speak, but we are rarely the center of attention.
What are the big questions?
It is
important to remember that the value of LIS make it a potentially strong contributor to the debate and analysis of such issues.
