Tuesday, November 21, 2006

Scalability Report on Triple Store Applications

This report examines a set of open source triple store systems suitable for The SIMILE Project's browser-like applications. Measurements on performance within a common hardware, software, and dataset environment grant insight on which systems hold the most promise for acting as large, remote backing stores for SIMILE's future requirements.
The SIMILE Project (Semantic Interoperability of Metadata In like and Unlike Environments) is a joint research project between the World Wide Web Consortium (W3C), Hewlett-Packard Labs (HP), the Massachusetts Institute of Technology / Computer Science and Artificial Intelligence Laboratory (MIT / CSAIL), and MIT Libraries. Funding is provided by HP.

http://simile.mit.edu/reports/stores/

Thursday, November 16, 2006

The Experts Talk: Thirteen Great Ways to Increase Java Performance

1. Use buffered I/O.Using unbuffered I/O causes a lot of system calls for methods like InputStream.read(). This is common in code that parses input, such as commands from the network or configuration data from the disk.

2. Try to avoid new.Garbage collection is rarely a serious performance overhead. But, Java1 virtual machine (JVM)-internal synchronization caused by the new operation can cause lock contention for applications with lots of threads. Sometimes new can be avoided by re-using byte arrays, or re-using objects that have some notion of a state-resetting method.

3. Native methods are really fast.This sounds silly, but I had heard that the overhead of invoking a native method was so high that it might be the case that small Java methods would be faster. Not! In my test case, I implemented System.arraycopy in plain Java. I then compared this using arrays of different sizes against System.arraycopy . The native (original) method was about an order of magnitude faster, depending on the array size. The native-method overhead may be high, but they're still fast compared to interpreting byte code. If you can use native methods in the JDK, then you can remain 100% pure and have a faster implementation than if you used interpreted methods to accomplish the same thing.

4. String operations are fast.Using x + y (where x and y are strings) is faster than doing a getBytes of the two and then creating a new String from the byte array. However, String operations can hide a lot of new operations.

5. InetAddress.getHostAddress() has a lot of new operations. It creates a lot of intermediate strings to return the host address. Avoid it, if possible.

6. java.util.Date has some performance problems, particularly with internationalization. If you frequently print out the current time as something other than the (long ms-since-epoch) that it is usually represented as, you may be able to cache your representation of the current time and then create a separate thread to update that representation every N seconds (N depends on how accurately you need to represent the current time). You could also delay converting the time until a client needs it, and the current representation is known to be stale.

7. Avoid java.lang.String.hashCode() .If the String's length exceeds 16 characters, hashCode() samples only a portion of the String. So if the places that a set of Strings differ in don't get sampled you can see lots of similar hash values. This can turn your hash tables into linked lists!

8. Architecture matters.In most applications, good performance comes from getting the architecture right. Using the right data structures for the problem you're solving is a lot more important than tweaking String operations. Thread architecture is also important. (Try to avoid wait/notify operations--they can cause a lot of lock contention in some VMs.) And of course you should use caching for your most expensive operations.

9. I have mixed feelings about java.util.Hashtable . It's nice to get so much functionality for free, but it is heavily synchronized. For instance, get() is a synchronized method. This means that the entire table is locked even while the hashCode() of the target key is computed.

10. String.getBytes() takes about ten times as long as String.getBytes(int srcBegin, int srcEnd, byte dst[], int dstBegin) . This is because the former does correct byte-to-char conversion, which involves a function call per character. The latter is deprecated, but you can get 10% faster than it without any deprecated methods using the following code:
static void getBytesFast() { String str = new String("the dark brown frog jumps the green tree"); // alloc the buffer outside loop so // all methods do one new per iteration... char buffer[] = new char[str.length()]; for(int i=0; i<10000; length =" str.length();" j =" 0;">byte conversion though.

11. Synchronized method invocation is about six times longer than non-synchronized invocation.This time hasn't been a problem in the Java Web Server, so we tend to try to break locks up into smaller locks to avoid lock contention. A lot of times people synchronize on an entire class for everything, even though the class contains variables that can be read/written concurrently without any loss of consistency. This calls for locking on the variables or creating dummy objects to serve as locks for the variables.

12. Be careful about using lots of debugging code.A lot of people do something like the following:
debug("foobar: " + x + y + "afasdfasdf");
public static void debug(String s) {
// System.err.println(s);
}
Then they think that they've turned off debugging overhead. Nope! If there are enough debugging statements, you can see a lot of time spent in creating new strings to evaluate "foobar: " + x + y + "afasdfasdf", which is then tossed after calling debug .

13. Profiles of the Java Web Server show it spending about 1-2% of its time running the garbage collector under most uses. So we rarely worry about performance of the garbage collector. The one thing you want to be careful about is response time. Running with a large heap size decreases the frequency of a garbage collection, but increases the hit taken when one occurs. The current VM pauses all your threads when a garbage collection occurs, so your users can see long pauses. Smaller heap sizes increase the frequency of garbage collections, but decrease their length.

http://java.sun.com/developer/technicalArticles/Programming/Performance/

Building and Managing a Massive Triple Store: An Experience Report

The aim of the Ingenta MetaStore project is to build a flexible and scalable repository for the storage of bibliographic metadata spanning 17 million articles and 20,000 publications.
The repository replaces several existing data stores and will act as a focal point for integration of a number of existing applications and future projects. Scalability, replication and robustness were important considerations in the repository design.
After introducing the benefits of using RDF as the data model for this repository, the paper will focus on the practical challenges involved in creating and managing a very large triple store.
The repository currently contains over 200 million triples from a range of vocabularies including FOAF, Dublin Core and PRISM.
The challenges faced range from schema design, data loading, SPARQL query performance. Load testing of the repository provided some insights into the tuning of SPARQL queries.
The paper will introduce the solutions developed to meet these challenges with the goal of helping others seeking to deploy a large triple store in a production environment. The paper will also suggest some avenues for further research and development.

http://xtech06.usefulinc.com/schedule/paper/18

Wednesday, November 15, 2006

Ontology Definition Metamodel (ODM) standard.

The Ontology Definition Metamodel (ODM), as defined in this specification, is a family of MOF metamodels, mappings
between those metamodels as well as mappings to and from UML, and a set of profiles that enable ontology modelling
through the use of UML-based tools. The metamodels that comprise the ODM reflect the abstract syntax of several
standard knowledge representation and conceptual modelling languages that have either been recently adopted by other
international standards bodies (e.g., RDF and OWL by the W3C), are in the process of being adopted (e.g., Common
Logic and Topic Maps by the ISO) or are considered industry de facto standards (non-normative ER and DL appendices).

Monday, November 06, 2006

Was the Universal Service Registry a Dream? A combination of the features in UDDI and RDF may just make the dream come true

Automatic Mapping of OWL Ontologies into Java

An approach for mapping an OWL ontology into Java. The basic idea is to create a set of Java
interfaces and classes from an OWL ontology such that an instance of a Java class represents an instance of a single class of the ontology with most of its properties, classrelationships
and restriction-definitions maintained. There exist some fundamental semantic
differences between Description Logic (DL) and Object Oriented (OO) systems, primarily related to completeness and satisfiability. We present various ways in which we
aim to minimize the impact of such differences, and show how to map a large part of the much richer OWL semantics into Java. Finally, we sketch the HarmonIA framework, which is used for the automatic generation of agent systems from institution specifications, and whose OWL Ontology Creation module was the basis for the tool presented in this paper.

more on: http://www.mindswap.org/~aditkal/SEKE04.pdf

Using RDF with SOAP.

This article examines ways that SOAP can be used to communicate information in RDF models. It discusses ways of translating the fundamental data in RDF models to the SOAP encoding for PC-like exchange, or for directly passing parts of the model in RDF/XML serialized form.

More on: http://www-128.ibm.com/developerworks/webservices/library/ws-soaprdf/

Saturday, November 04, 2006

RDFReactor

RDFReactor views the RDF data model through object-oriented Java proxies. It makes using RDF easy for Java developers.

Features
Think in objects, not statements
Read and write RDF data using familiar Java obects use person.setName("Max Mustermann") instead of addTriple( personURI, nameURI, "Max Mustermann" )
Dynamic state
all state information is at all times only in the RDF model in the triple store of your choice
Java interfaces are generated automatically from an RDF Schema
Mapping from RDF Schema to Java fully customizable
Thanks to Jena it reads: RDF/XML, N3 or NT syntax

http://rdfreactor.ontoware.org/

JRDF - Java RDF

JRDF is an attempt to create a standard set of APIs and base implementations to RDF (Resource Description Framework) using Java. The API will cover anything that is deemed to be useful for Java programmers. A key aspect will be to ensure a high degree of modularity and to follow standard Java conventions. It will be similar to other standard Java APIs such as JDBC, XML DOM, Collections, etc.

This project is based on the existing RDF libraries and is designed to include the best features from all available sources.

http://jrdf.sourceforge.net/