Tuesday, December 29, 2015

算法导论 第五部分 高级数据结构

二项堆能在Ologn 的时间内支持最坏情况的堆支持操作INSERT,MINUMUM,EXXTRACT-MIN,UNION.
二叉堆最坏情况时间下合并两个二叉堆需要O(n) 所以二项堆优于二叉堆
斐波那契堆对于二项堆有所改进。用平摊时间计算性能。INSERT,MINIMUM,UNION,DECREASE-KEY 需要O(1),extract-min,delete 需要O(logn)

Maven tips


build a new maven project:
mvn archetype:generate -DgroupId=com.mycompany.helloworld -DartifactId=helloworld -Dpackage=com.mycompany.helloworld -Dversion=1.0-SNAPSHOT

cd helloworld
mvn package

java -cp target/helloworld-1.0-SNAPSHOT.jar com.mycompany.helloworl

help to get more information
mvn help:effective-pom

clean the target folder
mvn clean 

mvn compile
mvn test
mvn install

Wednesday, December 2, 2015

Infomap


./Infomap output.net output/ -N 10 --directed --two-level --clu --map

Wednesday, November 11, 2015

java tips

1. enumeration:
public enum SimilarityEnum {
BM25, VSM, LMD, LMJ;

http://blog.csdn.net/wgw335363240/article/details/6397803

Monday, November 9, 2015

Word2Vec

1.Word2Vec basic introduction(Chinese):
http://blog.csdn.net/zhoubl668/article/details/24314769

2.java version:
http://blog.csdn.net/zhaoxinfan/article/details/11640573

3.将分好词的训练语料进行训练,假定我语料名称为test.txt且在word2vec目录中。输入命令:
./word2vec -train text8 -output vectors.bin -cbow 0 -size 48 -window 5 -negative 0 -hs 1 -sample 1e-4 -threads 20 -binary 1 -iter 100
以上命令 -train text8 表示的是输入文件是text8,-output vectors.bin 输出文件是vectors.bin,-cbow 0表示不使用cbow模型,默认为Skip-Gram模型。-size 48 每个单词的向量维度是48,-window 5 训练的窗口大小为5就是考虑一个词前五个和后五个词语(实际代码中还有一个随机选窗口的过程,窗口大小小于等于5)。-negative 0 -hs 1不使用NEG方法,使用HS方法。-sampe指的是采样的阈值,如果一个词语在训练样本中出现的频率越大,那么就越会被采样。-binary为1指的是结果二进制存储,为0是普通存储(普通存储的时候是可以打开看到词语和对应的向量的)除了以上命令中的参数,word2vec还有几个参数对我们比较有用比如-alpha设置学习速率,默认的为0.025. –min-count设置最低频率,默认是5,如果一个词语在文档中出现的次数小于5,那么就会丢弃。-classes设置聚类个数,看了一下源码用的是k-means聚类的方法。要注意-threads 20 线程数也会对结果产生影响。
注意:–min-count设置最低频率,默认是5,进行参数传递无效,我们可能是因为参数名中有-,唉我们只好在程序word2vec.c中将min-count初始化为1了。
· 架构:skip-gram(慢、对罕见字有利)vs CBOW(快)
· 训练算法:分层softmax(对罕见字有利)vs 负采样(对常见词和低纬向量有利)
· 欠采样频繁词:可以提高结果的准确性和速度(适用范围1e-3到1e-5)
· 文本(window)大小:skip-gram通常在10附近,CBOW通常在5附近

4.word2vec 的輸入必須是以空白隔開的詞
5. word2vec mathematics
sigmoid function

eg:
-cbow 1 -size 300 -window 5 -negative 3 -hs 0 -sample 1e-5 -threads 12 -binary 1 -iter 15



Lucene tips

1. how to extract query and analyse
eg: "new new new york" -->  "new" and " york"

String queryString="New New New York";
Query query = parser.parse(queryString);
Set<Term> queryTerms = new LinkedHashSet<Term>();

searcher.createNormalizedWeight(query, false).extractTerms(queryTerms);

2. scan input 
Scanner input=new Scanner(System.in);

3. Multifileds
Exposes flex API, merged from flex API of sub-segments. This is useful when you're interacting with an IndexReader implementation that consists of sequential sub-readers (egDirectoryReader or MultiReader).


4. get current path
curDir = System.getProperty("user.dir");

5. set similarity function:


String queryString = "police";
String index = "/Users/chunguo/Downloads/index";
IndexReader reader = DirectoryReader.
open(FSDirectory.open(Paths
.get(index)));
IndexSearcher searcher =
new IndexSearcher(reader);
QueryParser parser = new QueryParser("TEXT", analyzer);
Query query = parser.parse(queryString); System.
out.println("Searching for: " + query.toString("TEXT"));
TopDocs results = searcher.search(query, 1000);
//Print number of hits
int numTotalHits = results.totalHits; System.out.println(numTotalHits + " total matching documents");
//Print retrieved results
ScoreDoc[] hits = results.scoreDocs; for(int i=0;i<hits.length;i++){
Document doc=searcher.doc(hits[i].doc); System.out.println("DOCNO: "+doc.get("DOCNO"));
}





Yo can make Lucene ignore the special characters by sanitizing the query with something like
query = QueryParser.Escape(query)

If you do not want your users to ever use advanced syntax in their queries, you can do this always.



reader.close(); 

Thursday, November 5, 2015

pack an eclipse project to jar file

 一.工程没有引用外部jar包时(J2SE)
选中工程---->右键,Export...--->Java--->选择JAR file--->next-->选择jar file的路径及名称-->next-->next--- 选择Main class--->finish.

二.工程有引用外部jar包时(J2SE)
第一种方案
当工程引用了其他的外部jar时,由于eclipse不支持同时导出外部jar包的功能,所以比较麻烦一点;具体步骤如下:
1.生成manifest file:
选中工程---->右键,Export...--->Java--->选择JAR file--->next-->选择jar file的路径及名称-->next-->next--- 选择Main class--->finish。此时来到了要选择Main class的窗口,但这里先不选择Main class,选择最上方的Generate the manifest file(生成清单文件), 勾上Save the manifest in the workspace(将清单保存在工作空间中),在Manifest file(清单文件):中输入file的名称,如/testProject/main,(其中testProject为工程名,main为manifest file的名称)点击finish。
   生成的jar文件一运行就会产生错误:couldn't find main class
2.回到工程,打开刚刚生成的main,在这里要输入如下信息:
Manifest-Version: 1.0
Main-Class: com.pacong.convert.auto.propertes.ConvertAutoProperties
Class-Path: jxl.jar
其中,Manifest-Version: 1.0为刚刚自动生成的manifest版本号;
Main-Class: com.pacong.convert.auto.propertes.ConvertAutoProperties为Main class所在类;
Class-Path: jxl.jar为外部jar包名称,这里是告诉导出的jar包,所要用到的外部的jar包的路径及名称。
上面都做完了就可以导出带外部jar包的jar包了
3.选中工程--->右键,Export--->Java--->JAR file--->选择jar file的路径及名称--->next--->next--->next,这时又来到了选择main class的窗口,但这里也不选择main class,选择Use existing manifest from workspace,在Manifest file:里选择刚刚生成的main,如:/testProject/main,点击finish,到此就大功告成啦;假设生成的jar包的名称为test.jar,那么Class-Path中引入的jxl.jar要和test.jar位于同一个目录下。
不要选main.mf和manifest文件,在后面的选择existing manifest文件,变写manifest时:
Manifest-Version: 1.0
Main-Class:com.test.GameFrame  ; class:后面一定要有且只有1个空格
Class-Path: nimrodlf-1.2.jar liquidlnf.jar        ;path 后面一定要有且只有1个空格
另外,一定要注意Main-Class,Class-Path等后面会有一个直接的空格,不然会提示格式错误
最后必须有换行,不能有空格. 这个也是要注意的. 否则依然提示没有mainclass
jar creation failed
detail:invalid header field
用export runnable jar file时,不需要 么Class-Path中引入的jxl.jar要和test.jar位于同一个目录下。比较简单。
4.注意点:
如果想在命令行中运行刚刚生成的jar,命令如下:
C:\Documents and Settings\Administrator\桌面>java -jar test.jar
如果在jar中有一些System.out.prinln语句,运行后想看看打印出的内容,可以用一下命令:
C:\Documents and Settings\Administrator\桌面>java -jar test.jar >log.txt
输出信息会被打印到log.txt中,当然log.txt自动生成,并位于和test.jar一个目录中。
这里也可以新建一个start.bat文件,创建后;右键---编辑,输入:java -jar test.jar >log.txt;以后直接双击start.bat即可运行test.jar啦。
E盘,则你需要导入得包也要放在这个目录里。
第二种方案
将第三方包直接拷贝到jdk得安装目录中的jre/lib/ext/中,这时候我们需要重启一下Eclipse,这样才能将刚才的那个包加载进自动生成的系统库中,这时候按照上面的做法打包出来的jar包

第三种方案.用Fat jar插件来打包有引用外部jar包的项目(J2SE)
  1 如果你在程序开发的时候用到了第三方提供的API或者其它附属资源,在你导出生成jar文件时你必须将第三方的文件一并导出,不然你的程序运行将达不到你想要的结果.
   你可以利用Fat jar插件生成jar,Fat jar 的下载地址:http://sourceforge.net/projects/fjep/ 下载的文件是net.sf.fjep.fatjar_0.0.31.zip,解压缩后你会看到plugins文件夹中的net.sf.fjep.fatjar_0.0.31.jar文件(简称Fat jar).
插件安装:把Fat jar文件拷贝到你的eclipse目录下的plugins中..重新启动你的eclipse平台,然后查看Window---preferences, 弹出的窗口中有Fat jar preferences这一项则说明你已经安装成功了,没有这一项则未安装成功,需要多做如下几步操作:到你的eclipse目录下的 configuration---org.eclipse.update,并删除platform.xml文件,然后重新启动eclipse.查看 Window---preferences,你会发现Fat jar preferences这一项.恭喜你安装成功了2.右击你的项目,选项列表中有Build Fat jar这一项,选择它,点击Browser(Main-Class选择你的主类)---next--- finish(勾选你需要一起打包的资源,默认为全部勾选).瞧,jar文件就生成在你的工程项目中了.运行它,一切顺利.
2.用Fat jar打包:右击你的项目,选项列表中有Build Fat jar这一项,选择它,弹出的窗口你会觉得特别亲切,一目了然使用很方便,点击Browser(Main-Class选择你的主类)---next--- finish(勾选你需要一起打包的资源,默认为全部勾选).瞧,jar文件就生成在你的工程项目中了.运行它,一切顺利.
   我的eclipse,把插件复制到plugin下,删除了Platform.xml文件,且启动用eclipse.exe -clean 还是不行,插件安装不了,不知道怎么回事。




通过eclipse的Export 导出项,导出jar包时,有几点需要注意的事项。

假如我们导出的jar包名称为 demo.jar,右键解压demo.jar,META-INF 目录下面有 MANIFEST.MF文件,打开看到
[java] view plaincopy在CODE上查看代码片派生到我的代码片
  1. Manifest-Version: 1.0  
  2. Main-Class: com.zhangqi.you.main.JdbcTest  

1  Main-Class事项:
    main-class,顾名思义,主函数类,即demo.jar 默认运行时,执行的主方法类。
    在eclipse 导出jar的过程中,可以指定main-class属性,也可以不指定main-class,根据实际情况而定。

    如果导出的jar包中就只包含了一个main方法,其他的class类都是为这个类服务的,参与计算的,或引用的,那么就可以在导出时,直接指定main-class。
    如果导出的jar包中包含多个main方法,在运行时,需要根据情况而定执行哪一个主函数类,那么导出的过程中就不要指定main-class属性。在运行的时候,指定需要指定的main方法类即可。
  例如 java -jar demo.jar com.test.Demo1  运行 Demo1 类
          java -jar demo.jar com.test.Demo2  运行 Demo2 类

2  Class-Path事项:
     class-path,顾名思义,classpath,引用类路径。
     如果导出的jar包中引用了外部的jar包,那么直接运行java -jar demo.jar时,会报classnotfound异常的,这时候,需要为导出jar指定classpath
     打开MANIFEST.MF文件,在下面添加上 Class-Path: mysql-connector-java-5.0.8-bin.jar 便为demo.jar添加了mysql-connector-java-5.0.8-bin.jar的引用
[java] view plaincopy在CODE上查看代码片派生到我的代码片
  1. Manifest-Version: 1.0  
  2. Main-Class: com.zhangqi.you.main.JdbcTest  
  3. Class-Path: mysql-connector-java-5.0.8-bin.jar  


     同时将mysql-connector-java-5.0.8-bin.jar  驱动jar包放到跟demo.jar同目录下即可引用到。
     如果有多个jar包引用的话,每个jar包间空格隔开即可。

how to install software in mac without administrator permission

Using git to manage your code

Git 教程:https://www.liaoxuefeng.com/wiki/896043488029600

1. how to use github.iu?
github iu is an enterprise type of github.
You can add an account of github and an account of github,iu to desktop github.
By using desktop version, you can use two accounts both. But for terminal, you'd better specify an account. I choose to use github.iu.edu account.
You need to generate SSH keys for git to link to your local computer.
http://1ke.co/course/194
https://git-scm.com/book/zh/v2/Git-%E5%9F%BA%E7%A1%80-%E8%AE%B0%E5%BD%95%E6%AF%8F%E6%AC%A1%E6%9B%B4%E6%96%B0%E5%88%B0%E4%BB%93%E5%BA%93
2. how to simultaneously use two accounts in github desktop?
I think we can, but using one account I think is enough. We can use desktop version github to control.
3. how to realize the version control by using the git command?

upload local change to repository:
git reset --mixed origin/master
git add . 
git commit -m "This is a new commit for what I originally planned to be amended" 
git push origin master

there is another way which can combine add and commit together:
git commit -a -m 'added new benchmarks'
das



…or create a new repository on the command line

echo "# aaa" >> README.md
git init
git add .
git add README.md
git commit -m "first commit"
git remote add origin https://github.iu.edu/gao27/aaa.git
git push -u origin master

…or push an existing repository from the command line

git remote add origin https://github.iu.edu/gao27/aaa.git
git push -u origin master

#remove .git

rm -rf .git


help to check each commit 
git log 

shift+q exit log history 



how to upload the change to remote github
git add .
git commit -m "first commit"
git push -u origin master


how to pull new changes from github



check to convert to previous version
// open a new branch
 ZhengGao:src gaozheng$ git reset --hard HEAD
HEAD is now at 3b89d52 added new benchmarks
ZhengGao:src gaozheng$ git reset 3b89d52
ZhengGao:src gaozheng$ git reset --soft HEAD@{1}
ZhengGao:src gaozheng$ 
ZhengGao:src gaozheng$ git commit -m "Revert to 56e05fced"
[master e9681ab] Revert to 56e05fced
 3 files changed, 3 insertions(+)
 create mode 100644 test1.txt
 create mode 100644 test2.txt
 create mode 100644 test3.txt
ZhengGao:src gaozheng$ git reset --hard
HEAD is now at e9681ab Revert to 56e05fced

An easy way to collaborate two computers in git
1.init one folder as master 
git init
git add .
git add README.md
git commit -m "first commit"
git remote add origin https://github.iu.edu/gao27/aaa.git
git push -u origin master
2. copy 1 folder to another computer
3. git pull first and then update
git pull 
and then git add . 
git commit -m "first commit"
git push -u origin master

如何正确使用gitignore.
https://www.jianshu.com/p/a3e6b5b2ab59
如果gitignore 声明晚了,可以用参考 https://www.jianshu.com/p/e5b13480479b 比如用
git rm -r --cached .

如果想取消之前git add .的操作的话,可以看下git status. 然后

rm -f ./.git/index.lock

Neo4j features

gradle is a good tool to run code, like maven

Run java code in server&terminal

simple:
java -jar test.jar
if you have store the printout into a text, use the following command:
java -jar test.jar >log.txt

The main class is defined when the code is exported from eclipse.

question:
1. how to run different main classes in a same jar file?
2. how to define the memory usage for executing the jar?
 -Xms<size>
JVM default heap size. 
eg: -Xms6400K,-Xms256M
-Xmx<size>
JVM max heap size
eg: -Xmx81920K,-Xmx80M
so the command can be: java -jar test.jar >log.txt -Xms<size> -Xmx<size>

if you don't name main class when export, you can name main class flexibly.

java -cp myjar.jar MyClass

how to import extra jar package into jar file
http://www.cnblogs.com/youxin/archive/2012/06/03/2532914.html

simple ways to export jar with library
export ->runnable jar file-> launch configuration(decide which is main class, which means a jar can only have 1 main class to run ), fill the export destination(eg. a.jar)->finish


 nohup java -Xms4g -Xmx4g -jar centrality.jar 8edges.txt output.txt 10 100&

Nohup notes

Nohup is a function which can help you run code even you exit your terminal. All the print out information will be saved in a nohup.out file.

Some of the terminal functions:

nohup gradle run&
ps pid
exit
pgrep sleep
kill 5234
nohup sleep 200&
jobs
fg 1

sh:

echo -e "OK! \h" # -e 开启转义

Tuesday, September 1, 2015

What is a "document"?


Loosjes (1962, pp. 1-8) explained documentation in historical terms: Systematic access to written texts, he wrote, became more difficult after the invention of printing resulted in the proliferation of texts; scholars were increasingly obliged to delegate tasks to specialists; assembling and maintaining collections was the field of librarianship; bibliography was concerned with the descriptions of documents; the delegated task of creating access for scholars to the topical contents of documents, especially of parts within printed documents and without limitation to particular collections, was documentation.

Object --- Document? Star in sky -- No
Photo of star -- Yes Stone in river -- No Stone in museum -- Yes Animal in wild -- No Animal in zoo -- Yes 

INFORMATION AS THING


Three meanings of "information" are distinguished: "Information-as-process"; "information-as- knowledge"; and "information-as-thing", the attributive use of "information" to denote things regarded as informative. The nature and characteristics of "information-as-thing" are discussed, using an indirect approach ("What things are informative?"). Varieties of "information-as-thing" include data, text, documents, objects, and events. On this view "information" includes but extends beyond communication. Whatever information storage and retrieval systems store and retrieve is necessarily "information-as- thing".
These three meanings of "information", along with "information processing", offer a basis for classifying disparate information-related activities (e.g. rhetoric, bibliographic retrieval, statistical analysis) and, thereby, suggest a topography for "information science".

INTANGIBLE TANGIBLE ENTITY 2. Information-as-knowledge 3. Information-as-thing Knowledge Data, document PROCESS 1. Information-as-process 4. Information processing Becoming informed Data processing

The literature on information science has concentrated narrowly on data and documents as information resources.

we find the evidence of events is used in three different ways:
1. Objects, which can be collected or represented, may exist as evidence associated with events: bloodstains on the carpet, perhaps, or a footprint in the sand;
2. There may well be representations of the event itself: photos, newspaper reports, memoirs. Such documents can be stored and retrieved; and, also,
3. Events can, to some extent, be created or re-created. In experimental sciences, it is regarded as being of great importance that an experiment -- an event -- be designed and described in such a way that it can be replicated subsequently by others. Since an event cannot be stored and since accounts of the results are no more than hearsay evidence, the feasibility of re-enacting the experiment so that the validity of the evidence, of the information, can be verified is highly desirable.

"Information-as-thing", then, is meaningful in two senses: i) At quite specific situations and points in time an object or event may actually be informative, i.e. constitute evidence that is used in a way that affects someone's beliefs; and (ii) Since the use of evidence is predictable, albeit imperfectly, the term "information" is commonly and reasonably used to denote some population of objects to which some significant probability of being usefully informative in the future has been attributed. It is in this sense that collection development is concerned with collections of information.

Numerous definitions have been proposed for "information". One important use of "information" is to denote knowledge imparted; another is the denote the process of informing. Some leading theorists have dismissed the attributive use of "information" to refer to things that are informative. However, "information-as-thing" deserves careful examination, partly because it is the only form of information with which information systems can deal directly. People are informed not only by intentional communications, but by a wide variety of objects and events. Being "informative" is situational and it would be rash to state of any thing that it might not be informative, hence information, in some conceivable situation. Varieties of "information-as-thing" vary in their physical characteristics and so are not equally suited for storage and retrieval. There is, however, considerable scope for using representations instead. 

The Role of Facts in Paul Otlet’s Modernist Project of Documentation


Otlet’s writing is marked by its extraordinary energy. His muscular and tireless prose powers a huge machine, a vast assemblage of interconnected parts operating on multiple strata according to precisely articulated scales.

hyper-modernist energy that provokes the question of this chapter: how are we to understand the co-ordination and organization of the vectors assembled and animated in Otlet’s project of documentation? To use a mathematical trope, what is the singularity that organizes this vector field?

To understand Otlet on facts, it is useful to start with his distinction between natural and social science. The former builds a ‘great monument’ through disciplined,collective work. For natural scientists, ‘speculation and interpretation are secondary’, whereas ‘the social sciences are seen ... not as one discipline ... but as a gathering of personal opinions’ (11).

science is built upon facts: ‘The results of the natural sciences are grounded in millions of carefully observed, analysed, and catalogued facts. These facts have subsequently been integrated into sequences and the combination of these sequences has naturally led to the enunciation of laws, partial at first, general later, from which the most powerful and indestructible synthesis that has ever been made now seems possible’ (11).

The creation of documents in such a way that each item of information has its own identity

A fact can be fully revealed to the consciousness of a reader only when the ‘item of information’ that refers to that fact is constructed and organized in such a way, as Rayward put it, ‘that each item of information has its own identity.’ If the identity of the item of information is somehow compromised, so is the revelation of the fact.  


The application of the monographic principle does not complete the documentation of facts 


The results of the natural sciences are grounded in millions of carefully observed, analysed, and catalogued facts. These facts have subsequently been integrated into sequences and the combination of these sequences has naturally led to the enunciation of laws

trimmed stone 

Monday, August 31, 2015

09/04/2015 Reading

1. The Big Toe
history of big toe or foot in different countries such as Spain and china.

2.What is Documentation? 






Sunday, August 30, 2015

Chapter 2 Vocabulary

1. Posting skip pointers. for a postings list of length P, use P evenly-spaced skip pointers. This heuristic can be improved upon; it ignores any details of the distribution of query terms.

2. Most recent search engines support a double quotes syntax (“stanford university”) for phrase queries, which has proven to be very easily understood and successfully used by users.
 
3. The concept of a biword index can be extended to longer sequences of words, and if the index includes variable length word sequences, it is gen- erally referred to as a phrase index.
 
4.For the reasons given, a biword index is not the standard solution. Rather, a positional index is most commonly employed.
 
5.Let’s examine the space implications of having a positional index. A post- ing now needs an entry for each occurrence of a term. The index size thus depends on the average document size. The average web page has less than 1000 terms, but documents like SEC stock filings, books, and even some epic poems easily reach 100,000 terms. Consider a term with frequency 1 in 1000 terms on average. The result is that large documents cause an increase of two orders of magnitude in the space required to store the postings list:
Expected Expected entries Document size postings in positional posting
1000 1 1 100,000 1 100 

Wednesday, August 26, 2015

How to give more persuasive presentations: A Q&A with Nancy Duarte


The speaker needs the audience more than the audience needs the speaker.
 Going from sounding memorized and canned to sounding natural is a lot of work.
 What is the best way to start creating a presentation?
My best advice is to not start in PowerPoint.

But on the stage, you have to move your body in really big gestures.  

Making presentations


Sunday, August 23, 2015

Keynote address: Library and information science as a research domain: problems and prospects


Most recently, LIS schools were accused by leaders of the profession of failing to educate students appropriately for the workplace and of engaging in esoteric and irrelevant research that was out of touch with real world needs.

While, A community of information schools known as the "iSchool Caucus" has been founded that has no affiliation with a professional association in LIS yet it contains significant numbers of the leading LIS programs in North America.  

LIS:two camps: the library and the information sides.

we might all agree generally that issues of information retrieval, information quality and authenticity, policy for access and preservation, the health and security applications of data mining, raise at least some big questions for information research to study.
  
The points made about IR can be made more or less equivalently, I'd argue, for many other of the current hot topics in information research. We are at the party, so to speak, but we are rarely the center of attention.  

What are the big questions?

It is important to remember that the value of LIS make it a potentially strong contributor to the debate and analysis of such issues.