Inspired by Actual Events: Java 7

Showing posts with label Java 7. Show all posts

Friday, August 17, 2018

Carefully Specify Multiple Resources in Single try-with-resources Statement

One of the more useful new features of Java 7 was the introduction of the try-with-resources statement [AKA Automatic Resource Management (ARM)]. The attractiveness of the try-with-resources statement lies in its promise to "ensure that each resource is closed at the end of the statement." A "resource" in this context is any class that implements AutoCloseable and its close() method and is instantiated inside the "try" clause of the try-with-resources statement.

The Java Language Specification [JLS] describes the try-with-resource statement in detail in Section 14.20.3 (of Java SE 10 JLS in this case). The JLS states that the "try-with-resources statement is parameterized with local variables (known as resources) that are initialized before execution of the try block and closed automatically, in the reverse order from which they were initialized, after execution of the try block."

The JLS clearly specifies that multiple resources can be defined in relation to a single try-with-resources statement and it specifies how multiple resources are specified. Specifically it indicates that try can be followed by a "ResourceSpecification" that is composed of a "ResourceList" that is composed of one or more "Resource"s. When there is more than a single declared resource, the multiple resources are delimited by semicolon (;). This specification of multiple resources in semicolon-delimited list is important because any candidate resources not declared in this manner will not be supported (will not be closed automatically) by the try-with-resources statement.

The most likely source of errors when specifying multiple resources in a try-with-resources statement is "nesting" instantiations of "resources" instead of explicitly instantiating local variables of each of them separately with semicolons between each instantiation. Examples that follow will illustrate the difference.

Two ridiculous but illustrative classes are shown next. Each class implements AutoCloseable and so can be used in conjunction with try-with-resources and will have its close() method called automatically when used correctly with the try-with-resources statement. They are named to reflect that the OuterResource can be instantiated with an instance of the InnerResource.

InnerResource.java

package dustin.examples.exceptions;

import static java.lang.System.out;

public class InnerResource implements AutoCloseable
{
   public InnerResource()
   {
      out.println("InnerResource created.");
   }

   public InnerResource(
      final RuntimeException exceptionToThrow)
   {
      throw  exceptionToThrow != null
         ? exceptionToThrow
         : new RuntimeException("InnerResource: No exception provided.");
   }

   @Override
   public void close() throws Exception
   {
      out.println("InnerResource closed.");
   }

   @Override
   public String toString()
   {
      return "InnerResource";
   }
}

OuterResource.java

package dustin.examples.exceptions;

import static java.lang.System.out;

public class OuterResource implements AutoCloseable
{
   private final InnerResource wrappedInnerResource;

   public OuterResource(final InnerResource newInnerResource)
   {
      out.println("OuterResource created.");
      wrappedInnerResource = newInnerResource;
   }

   public OuterResource(
      final InnerResource newInnerResource,
      final RuntimeException exceptionToThrow)
   {
      wrappedInnerResource = newInnerResource;
      throw  exceptionToThrow != null
           ? exceptionToThrow
           : new RuntimeException("OuterResource: No exception provided.");
   }

   @Override
   public void close() throws Exception
   {
      out.println("OuterResource closed.");
   }

   @Override
   public String toString()
   {
      return "OuterResource";
   }
}

The two classes just defined can now be used to demonstrate the difference between correctly declaring instances of each in the same try-with-resources statement in a semicolon-delimited list and incorrectly nesting instantiation of the inner resource within the constructor of the outer resource. The latter approach doesn't work as well as hoped because the inner resource without a locally defined variable is not treated as a "resource" in terms of invoking its AutoCloseable.close() method.

The next code listing demonstrates the incorrect approach for instantiating "resources" in the try-with-resources statement.

Incorrect Approach for Instantiating Resources in try-with-resources Statement

try (OuterResource outer = new OuterResource(
        new InnerResource(), new RuntimeException("OUTER")))
{
   out.println(outer);
}
catch (Exception exception)
{
   out.println("ERROR: " + exception);
}

When the code above is executed, the output "InnerResource created." is seen, but no output is ever presented related to the resource's closure. This is because the instance of InnerResource was instantiated within the call to the constructor of the OuterResource class and was never assigned to its own separate variable in the resource list of the try-with-resource statement. With a real resource, the implication of this is that the resource is not closed properly.

The next code listing demonstrates the correct approach for instantiating "resources" in the try-with-resources statement.

Correct Approach for Instantiating Resources in try-with-resources Statement

try(InnerResource inner = new InnerResource();
    OuterResource outer = new OuterResource(inner, new RuntimeException("OUTER")))
{
   out.println(outer);
}
catch (Exception exception)
{
   out.println("ERROR: " + exception);
}

When the code immediately above is executed, the output includes both "InnerResource created." AND "InnerResource closed." because the InnerResource instance was properly assigned to a variable within the try-with-resources statement and so its close() method is properly called even when an exception occurs during its instantiation.

The try-with-resources Statement section of the Java Tutorials includes a couple of examples of correctly specifying the resources in the try-with-resources as semicolon-delimited individual variable definitions. One example shows this correct approach with java.util.zip.ZipFile and java.io.BufferedWriter and another example shows this correct approach with an instances of java.sql.Statement and java.sql.ResultSet.

The introduction of try-with-resources in JDK 7 was a welcome addition to the language that it made it easier for Java developers to write resource-safe applications that were not as likely to leak or waste resources. However, when multiple resources are declared within a single try-with-resources statement, it's important to ensure that each resource is individually instantiated and assigned to its own variable declared within the try's resource specifier list to ensure that each and every resource is properly closed. A quick way to check this is to ensure that for n AutoCloseableimplementing resources specified in the try, there should be n-1 semicolons separating those instantiated resources.

Saturday, February 24, 2018

Prefer System.lineSeparator() for Writing System-Dependent Line Separator Strings in Java

JDK 7 introduced a new method on the java.lang.System class called lineSeparator(). This method does not expect any arguments and returns a String that represents "the system-dependent line separator string." The Javadoc documentation for this method also states that System.lineSeparator() "always returns the same value - the initial value of the system property line.separator." It further explains, "On UNIX systems, it returns "\n"; on Microsoft Windows systems it returns "\r\n"."

Given that a Java developer has long been able to use System.getProperty("line.separator") to get this system-dependent line separator value, why would that same Java developer now favor using System.lineSeparator instead? JDK-8198645 ["Use System.lineSeparator() instead of getProperty("line.separator")"] provides a couple reasons for favoring System.lineSeparator() over the System.getProperty(String) approach in its "Description":

A number of classes in the base module use System.getProperty("line.separator") and could use the more efficient System.lineSeparator() to simplify the code and improve performance.

As the "Description" in JDK-8198645 states, use of System.lineSeparator() is simpler to use and more efficient than System.getProperty("line.separator"). A recent message on the core-libs-dev mailing list provides more details and Roger Riggs writes in that message that System.lineSeparator() "uses the line separator from System instead of looking it up in the properties each time."

The performance benefit of using System.lineSeparator() over using System.getProperty("line.separator") is probably not all that significant in many cases. However, given its simplicity, there's no reason not to gain a performance benefit (even if tiny and difficult to measure in many cases) while writing simpler code. One of the drawbacks to the System.getProperty(String) approach is that one has to ensure that the exactly matching property name is provided to that method. With String-based APIs, there's always a risk of spelling the string wrong (I have seen "separator" misspelled numerous times as "seperator"), using the wrong case, or accidentally introducing other typos that prevent exact matches from being made.

The JDK issue that introduced this feature to JDK 7, JDK-6900043 ("Add method to return line.separator property"), also spells out some benefits in its "Description": "Querying the line.separator value is a common occurrence in large systems. Doing this properly is verbose and involves possible security failures; having a method return this value would be beneficial." Duplicate JDK-6264243 ("File.lineSeparator() to retrieve value of commonly used 'line.separator' system property") spells out advantages of this approach with even more detail and lists "correctness", "performance", and "ease of use and cross-platform development" as high-level advantages. Another duplicate issue, JDK-6529790 ("Please add LINE_SEPARATOR constant into System or some other class"), makes the point that there should be a "constant" added to "some standard Java class, as String or System," in a manner similar to that provided for file separators by File.pathSeparator.

One of the messages associated with the JDK 7 introduction of System.lineSeparator() justifies it additions with this description:

Lots of classes need to use System.getProperty("line.separator"). Many don't do it right because you need to use a doPrivileged block whenever you read a system property. Yet it is no secret - you can divine the line separator even if you have no trust with the security manager.

An interesting side note related to the addition of System.lineSeparator() in JDK 7 is that the Javadoc at that time did not indicate that the method was new to JDK 7. JDK-7082231 ("Put a @since 1.7 on System.lineSeparator") addressed this in JDK 8 and two other JDK issues (JDK-8011796 and JDK-7094275) indicate that this was desired by multiple Java developers.

The introduction of System.lineSeparator() was a very small enhancement, but it does improve the safety and readability of a relatively commonly used API while not reducing (and, in fact, while improving) performance.

Monday, February 29, 2016

jcmd: One JDK Command-Line Tool to Rule Them All

I have referenced the handy JDK tool jcmd in several posts in the past, but focus exclusively on its usefulness here like I have previously done for jps. The jcmd tool was introduced with Oracle's Java 7 and is particularly useful in troubleshooting issues with JVM applications by using it to identify Java processes' IDs (akin to jps), acquiring heap dumps (akin to jmap), acquiring thread dumps (akin to jstack), viewing virtual machine characteristics such as system properties and command-line flags (akin to jinfo), and acquiring garbage collection statistics (akin to jstat). The jcmd tool has been called "a swiss-army knife for investigating and resolving issues with your JVM application" and a "hidden gem."

When using most JDK command-line tools (including jcmd), it's often important to identify the process ID (pid) of the Java process for which we want to use the command-line tool. This is easily accomplished with jcmd by simply running the command without any arguments as shown in the next screen snapshot.

Running jcmd without arguments in the example above shows two Java processes running (jcmd itself with a pid of 324 and another Java process with a pid of 7268). Note that although jcmd works very much like jps when it comes to listing Java processes, jcmd lists more information than jps does without arguments -lm.

Running jcmd -h shows help and usage information for jcmd as demonstrated in the next screen snapshot.

The help explains, as was just shown, that jcmd "lists Java processes" when "no options are given." The help also states that this is behavior similar to running jcmd -p, but I think it means to say running jcmd without options is equivalent to running jcmd -l, which is shown in the next screen snapshot.

As when jcmd was run without any options, jcmd -l lists Java processes and their respective pids. The pids are different in this example because it's a different execution of jcmd and I have a different Java process running this time.

Running jcmd -h showed relatively few options. To see help on the many capabilities that jcmd supports, one needs to ask jcmd which capabilities are supported for a particular Java process. The next screen snapshot illustrates this. I first run jcmd without options to discover the pid of the Java process of interest (6320 in this case). Then, I am able to run jcmd 6320 help to see which commands jcmd supports.

The previous screen snapshot demonstrates the commands jcmd supports for the particular Java VM identified by the pid. Specifically, it states, "The following commands are available:" and then lists them:

JFR.stop
JFR.start
JFR.dump
JFR.check
VM.native_memory
VM.check_commercial_features
VM.unlock_commercial_features
ManagementAgent.stop
ManagementAgent.start_local
ManagementAgent.start
GC.rotate_log
GC.class_stats
GC.class_histogram
GC.heap_dump
GC.run_finalization
GC.run
Thread.print
VM.uptime
VM.flags
VM.system_properties
VM.command_line
VM.version
help

When jcmd <pid> help is run against a pid for a different Java VM process, it's possible to get a different list of available commands. This is illustrated in the next screen snapshot when jcmd 1216 help is executed against that process with pid of 1216.

By comparing the last two screen snapshots, it becomes clear that jcmd supports different commands for different Java VM instances. This is why the supported commands for a particular VM are listed by specifying the pid in the help command. Some of the commands available against the second VM (pid 1216 in this case) that were not listed for the originally checked VM include the following:

VM.log
ManagementAgent.status
Compiler.directives_clear
Compiler.directives_remove
Compiler.directives_add
Compiler.directives_print
VM.print_touched_methods
Compiler.codecache
Compiler.codelist
Compiler.queue
VM.classloader_stats
JVMTI.data_dump
VM.stringtable
VM.symboltable
VM.class_hierarchy
GC.finalizer_info
GC.heap_info
VM.info
VM.dynlibs
VM.set_flag

This "help" also advises, "For more information about a specific command use 'help <command>'." Doing this is illustrated in the next screen snapshot specifically for jcmd's Thread.print

While on the subject of jcmd Thread.print command, it's a good time to illustrate using this to see thread stacks of Java processes. The next screen snapshot shows the beginning of the much lengthier results seen when jcmd <pid> Thread.print is executed (in this case for the Java process with pid 6320).

There are several VM.* commands supported by jcmd: VM.version, VM.uptime, VM.command_line, VM.flags, VM.system_properties, VM.native_memory, and VM.classloader_stats. The next screen snapshot illustrates use of jcmd <pid> VM.version and jcmd <pid> VM.uptime for the Java process with pid 6320.

The next screen snapshot demonstrates execution of jcmd <pid> VM.command_line against process with pid 6320.

From this screen snapshot which shows the top portion of the output from running jcmd 6320 VM.command_line, we can see from the JVM command-line arguments that were provided to this process that it's a NetBeans-related process. Running the command jcmd <pid> VM.flags against Java process with pid 6320 shows the HotSpot options passed to that process.

The system properties used by a Java process can be listed using jcmd <pid> VM.system_properties and this is illustrated in the next screen snapshot.

When one attempts to run jcmd <pid> VM.native_memory against a Java process that hasn't had Native Memory Tracking (NMT) enabled, the error message "Native Memory Tracking is not enabled" is printed as shown in the next screen snapshot.

To use the command jcmd <pid> VM.native_memory, the JVM (java process) to be measured should be started with either the -XX:NativeMemoryTracking=summary or -XX:NativeMemoryTracking=detail options. Once the VM has been started with either of those options, the commands jcmd <pid> VM.native_memory baseline and then jcmd <pid> VM.native_memory detail.diff can be executed against that JVM process.

The command jcmd <pid> VM.classloader_stats provides insight into the classloader. This is shown in the next screen snapshot against Java process with pid 1216:

jcmd <pid> VM.class_hierarchy is an interesting command that prints the hierarchy of the classes loaded in the targeted Java VM process.

jcmd <pid> VM.dynlibs can be used to view dynamic libraries information. This is demonstrated in the next screen snapshot when executed against Java process with pid 1216.

The jcmd <pid> VM.info lists a lot of information regarding the targeted Java VM process including a VM summary and information about the process, garbage collection events, dynamic libraries, arguments provided to the VM, and some of the characteristics of the host machine. Just a small part of the beginning of the output of this is demonstrated in the next screen snapshot for jcmd 1216 VM.info:

The next screen snapshot demonstrates use of jcmd <pid> VM.stringtable and jcmd <pid> VM.symboltable:

Use of jcmd <pid> Compiler.directives_print is demonstrated in the next screen snapshot.

Several commands supported by jcmd support managing and monitoring garbage collection. Two of these are jcmd <pid> GC.run [similar to System.gc()] and jcmd <pid> GC.run_finalization [similar to System.runFinalization()]. The two of these are demonstrated in the next screen snapshot.

The command jcmd <pid> GC.class_histogram provides a handy way to view an object histogram as shown in the next screen snapshot.

jcmd can be used to generate a heap dump against a running Java VM with jcmd <pid> GC.heap_dump <filename> and this is demonstrated in the next screen snapshot.

The jhat command can now be used to process the heap dump generated by jcmd as shown in the next two screen snapshots.

There are some jcmd commands that only work against Java VMs that were started using the -XX:+UnlockDiagnosticVMOptions JVM flag. The next screen snapshot demonstrates what happens when I try to run jcmd <pid> GC.class_stats against a Java VM that wasn't started with the flag -XX:+UnlockDiagnosticVMOptions.

When the targeted VM is started with -XX:+UnlockDiagnosticVMOptions, jcmd <pid> GC.class_stats displays "statistics about Java class metadata."

This post has covered several of the commands provided by jcmd, but has not covered the functionality related to Java Flight Recorder [JFR] (commands with names starting with JFR.*), related to checking and enabling commercial features (jcmd <pid> VM.check_commercial_features and jcmd <pid> VM.unlock_commercial_features), and related to JMX-based management (ManagementAgent.start, ManagementAgent.start_local, ManagementAgent.stop, and ManagementAgent.status).

Relationship of jcmd to Other Command-line JDK Tools

Functionality	jcmd	Similar Tool
Listing Java Processes	`jcmd`	`jps -lm`
Heap Dumps	`jcmd <pid> GC.heap_dump`	`jmap -dump <pid>`
Heap Usage Histogram	`jcmd <pid> GC.class_histogram`	`jmap -histo <pid>`
Thread Dump	`jcmd <pid> Thread.print`	`jstack <pid>`
List System Properties	`jcmd <pid> VM.system_properties`	`jinfo -sysprops <pid>`
List VM Flags	`jcmd <pid> VM.flags`	`jinfo -flags <pid>`

Conclusion

In one command-line tool, jcmd brings together the functionality of several command-line JDK tools. This post has demonstrated several of the functions provided by jcmd.

My Other Blog Posts Referencing jcmd

Wednesday, February 18, 2015

Determining File Types in Java

Programmatically determining the type of a file can be surprisingly tricky and there have been many content-based file identification approaches proposed and implemented. There are several implementations available in Java for detecting file types and most of them are largely or solely based on files' extensions. This post looks at some of the most commonly available implementations of file type detection in Java.

Several approaches to identifying file types in Java are demonstrated in this post. Each approach is briefly described, illustrated with a code listing, and then associated with output that demonstrates how different common files are typed based on extensions. Some of the approaches are configurable, but all examples shown here use "default" mappings as provided out-of-the-box unless otherwise stated.

About the Examples

The screen snapshots shown in this post are of each listed code snippet run against certain subject files created to test the different implementations of file type detection in Java. Before covering these approaches and demonstrating the type each approach detects, I list the files under test and what they are named and what they really are.

File Name	File Extension	File Type	Type Matches Extension Convention?
actualXml.xml	xml	XML	Yes
blogPostPDF		PDF	No
blogPost.pdf	pdf	PDF	Yes
blogPost.gif	gif	GIF	Yes
blogPost.jpg	jpg	JPEG	Yes
blogPost.png	png	PNG	Yes
blogPostPDF.txt	txt	PDF	No
blogPostPDF.xml	xml	PDF	No
blogPostPNG.gif	gif	PNG	No
blogPostPNG.jpg	jpg	PNG	No
dustin.txt	txt	Text	Yes
dustin.xml	xml	Text	No
dustin		Text	No

Files.probeContentType(Path) [JDK 7]

Java SE 7 introduced the highly utilitarian Files class and that class's Javadoc succinctly describes its use: "This class consists exclusively of static methods that operate on files, directories, or other types of files" and, "in most cases, the methods defined here will delegate to the associated file system provider to perform the file operations."

The java.nio.file.Files class provides the method probeContentType(Path) that "probes the content type of a file" through use of "the installed FileTypeDetector implementations" (the Javadoc also notes that "a given invocation of the Java virtual machine maintains a system-wide list of file type detectors").

/**
 * Identify file type of file with provided path and name
 * using JDK 7's Files.probeContentType(Path).
 *
 * @param fileName Name of file whose type is desired.
 * @return String representing identified type of file with provided name.
 */
public String identifyFileTypeUsingFilesProbeContentType(final String fileName)
{
   String fileType = "Undetermined";
   final File file = new File(fileName);
   try
   {
      fileType = Files.probeContentType(file.toPath());
   }
   catch (IOException ioException)
   {
      out.println(
           "ERROR: Unable to determine file type for " + fileName
              + " due to exception " + ioException);
   }
   return fileType;
}

When the above Files.probeContentType(Path)-based approach is executed against the set of files previously defined, the output appears as shown in the next screen snapshot.

The screen snapshot indicates that the default behavior for Files.probeContentType(Path) on my JVM seems to be tightly coupled to the file extension. The files with no extensions show "null" for file type and the other listed file types match the files' extensions rather than their actual content. For example, all three files with names starting with "dustin" are really the same single-sentence text file, but Files.probeContentType(Path) states that they are each a different type and the listed types are tightly correlated with the different file extensions for essentially the same text file.

MimetypesFileTypeMap.getContentType(String) [JDK 6]

The class MimetypesFileTypeMap was introduced with Java SE 6 to provide "data typing of files via their file extension" using "the .mime.types format." The class's Javadoc explains where in a given system the class looks for MIME types file entries. My example uses the ones that come out-of-the-box with my JDK 8 installation. The next code listing demonstrates use of javax.activation.MimetypesFileTypeMap.

/**
 * Identify file type of file with provided name using
 * JDK 6's MimetypesFileTypeMap.
 *
 * See Javadoc documentation for MimetypesFileTypeMap class
 * (http://docs.oracle.com/javase/8/docs/api/javax/activation/MimetypesFileTypeMap.html)
 * for details on how to configure mapping of file types or extensions.
 */
public String identifyFileTypeUsingMimetypesFileTypeMap(final String fileName)
{    
   final MimetypesFileTypeMap fileTypeMap = new MimetypesFileTypeMap();
   return fileTypeMap.getContentType(fileName);
}

The next screen snapshot demonstrates the output from running this example against the set of test files.

This output indicates that the MimetypesFileTypeMap approach returns the MIME type of application/octet-stream for several files including the XML files and the text files without a .txt suffix. We see also that, like the previously discussed approach, this approach in some cases uses the file's extension to determine the file type and so incorrectly reports the file's actual file type when that type is different than what its extension conventionally implies.

URLConnection.getContentType()

I will be covering three methods in URLConnection that support file type detection. The first is URLConnection.getContentType(), a method that "returns the value of the content-type header field." Use of this instance method is demonstrated in the next code listing and the output from running that code against the common test files is shown after the code listing.

/**
 * Identify file type of file with provided path and name
 * using JDK's URLConnection.getContentType().
 *
 * @param fileName Name of file whose type is desired.
 * @return Type of file for which name was provided.
 */
public String identifyFileTypeUsingUrlConnectionGetContentType(final String fileName)
{
   String fileType = "Undetermined";
   try
   {
      final URL url = new URL("file://" + fileName);
      final URLConnection connection = url.openConnection();
      fileType = connection.getContentType();
   }
   catch (MalformedURLException badUrlEx)
   {
      out.println("ERROR: Bad URL - " + badUrlEx);
   }
   catch (IOException ioEx)
   {
      out.println("Cannot access URLConnection - " + ioEx);
   }
   return fileType;
}

The file detection approach using URLConnection.getContentType() is highly coupled to files' extensions rather than the actual file type. When there is no extension, the String returned is "content/unknown."

URLConnection.guessContentTypeFromName(String)

The second file detection approach provided by URLConnection that I'll cover here is its method guessContentTypeFromName(String). Use of this static method is demonstrated in the next code listing and associated output screen snapshot.

/**
 * Identify file type of file with provided path and name
 * using JDK's URLConnection.guessContentTypeFromName(String).
 *
 * @param fileName Name of file whose type is desired.
 * @return Type of file for which name was provided.
 */
public String identifyFileTypeUsingUrlConnectionGuessContentTypeFromName(final String fileName)
{
   return URLConnection.guessContentTypeFromName(fileName);
}

URLConnection's guessContentTypeFromName(String) approach to file detection shows "null" for files without file extensions and otherwise returns file type String representations that closely mirror the files' extensions. These results are very similar to those provided by the Files.probeContentType(Path) approach shown earlier with the one notable difference being that URLConnection's guessContentTypeFromName(String) approach identifies files with .xml extension as being of file type "application/xml" while Files.probeContentType(Path) identifies these same files' types as "text/xml".

URLConnection.guessContentTypeFromStream(InputStream)

The third approach I cover that is provided by URLConnection for file type detection is via the class's static method guessContentTypeFromStream(InputStream). A code listing employing this approach and associated output in a screen snapshot are shown next.

/**
 * Identify file type of file with provided path and name
 * using JDK's URLConnection.guessContentTypeFromStream(InputStream).
 *
 * @param fileName Name of file whose type is desired.
 * @return Type of file for which name was provided.
 */
public String identifyFileTypeUsingUrlConnectionGuessContentTypeFromStream(final String fileName)
{
   String fileType;
   try
   {
      fileType = URLConnection.guessContentTypeFromStream(new FileInputStream(new File(fileName)));
   }
   catch (IOException ex)
   {
      out.println("ERROR: Unable to process file type for " + fileName + " - " + ex);
      fileType = "null";
   }
   return fileType;
}

All the file types are null! The reason for this appears to be explained by the Javadoc for the InputStream parameter of the URLConnection.guessContentTypeFromStream(InputStream) method: "an input stream that supports marks." It turns out that the instances of FileInputStream in my examples do not support marks (their calls to markSupported() all return false).

Apache Tika

All of the examples of file detection covered in this post so far have been approaches provided by the JDK. There are third-party libraries that can also be used to detect file types in Java. One example is Apache Tika, a "content analysis toolkit" that "detects and extracts metadata and text from over a thousand different file types." In this post, I look at using Tika's facade class and its detect(String) method to detect file types. The instance method call is the same in the three examples I show, but the results are different because each instance of the Tika facade class is instantiated with a different Detector.

The instantiations of Tika instances with different Detectors is shown in the next code listing.

/** Instance of Tika facade class with default configuration. */
private final Tika defaultTika = new Tika();

/** Instance of Tika facade class with MimeTypes detector. */
private final Tika mimeTika = new Tika(new MimeTypes());
his is 
/** Instance of Tika facade class with Type detector. */
private final Tika typeTika = new Tika(new TypeDetector());

With these three instances of Tika instantiated with their respective Detectors, we can call the detect(String) method on each instance for the set of test files. The code for this is shown next.

/**
 * Identify file type of file with provided name using
 * Tika's default configuration.
 *
 * @param fileName Name of file for which file type is desired.
 * @return Type of file for which file name was provided.
 */
public String identifyFileTypeUsingDefaultTika(final String fileName)
{
   return defaultTika.detect(fileName);
}

/**
 * Identify file type of file with provided name using
 * Tika's with a MimeTypes detector.
 *
 * @param fileName Name of file for which file type is desired.
 * @return Type of file for which file name was provided.
 */
public String identifyFileTypeUsingMimeTypesTika(final String fileName)
{
   return mimeTika.detect(fileName);
}

/**
 * Identify file type of file with provided name using
 * Tika's with a Types detector.
 *
 * @param fileName Name of file for which file type is desired.
 * @return Type of file for which file name was provided.
 */
public String identifyFileTypeUsingTypeDetectorTika(final String fileName)
{
   return typeTika.detect(fileName);
}

When the three above Tika detection examples are executed against the same set of files are used in the previous examples, the output appears as shown in the next screen snapshot.

We can see from the output that the default Tika detector reports file types similarly to some of the other approaches shown earlier in this post (very tightly tied to the file's extension). The other two demonstrated detectors state that the file type is application/octet-stream in most cases. Because I called the overloaded version of detect(-) that accepts a String, the file type detection is "based on known file name extensions."

If the overloaded detect(File) method is used instead of detect(String), the identified file type results are much better than the previous Tika examples and the previous JDK examples. In fact, the "fake" extensions don't fool the detectors as much and the default Tika detector is especially good in my examples at identifying the appropriate file type even when the extension is not the normal one associated with that file type. The code for using Tika.detect(File) and the associated output are shown next.

   /**
    * Identify file type of file with provided name using
    * Tika's default configuration.
    *
    * @param fileName Name of file for which file type is desired.
    * @return Type of file for which file name was provided.
    */
   public String identifyFileTypeUsingDefaultTikaForFile(final String fileName)
   {
      String fileType;
      try
      {
         final File file = new File(fileName);
         fileType = defaultTika.detect(file);
      }
      catch (IOException ioEx)
      {
         out.println("Unable to detect type of file " + fileName + " - " + ioEx);
         fileType = "Unknown";
      }
      return fileType;
   }

   /**
    * Identify file type of file with provided name using
    * Tika's with a MimeTypes detector.
    *
    * @param fileName Name of file for which file type is desired.
    * @return Type of file for which file name was provided.
    */
   public String identifyFileTypeUsingMimeTypesTikaForFile(final String fileName)
   {
      String fileType;
      try
      {
         final File file = new File(fileName);
         fileType = mimeTika.detect(file);
      }
      catch (IOException ioEx)
      {
         out.println("Unable to detect type of file " + fileName + " - " + ioEx);
         fileType = "Unknown";
      }
      return fileType;
   }

   /**
    * Identify file type of file with provided name using
    * Tika's with a Types detector.
    *
    * @param fileName Name of file for which file type is desired.
    * @return Type of file for which file name was provided.
    */
   public String identifyFileTypeUsingTypeDetectorTikaForFile(final String fileName)
   {
      String fileType;
      try
      {
         final File file = new File(fileName);
         fileType = typeTika.detect(file);
      }
      catch (IOException ioEx)
      {
         out.println("Unable to detect type of file " + fileName + " - " + ioEx);
         fileType = "Unknown";
      }
      return fileType;
   }

Caveats and Customization

File type detection is not a trivial feat to pull off. The Java approaches for file detection demonstrated in this post provide basic approaches to file detection that are often highly dependent on a file name's extension. If files are named with conventional extensions that are recognized by the file detection approach, these approaches are typically sufficient. However, if unconventional file type extensions are used or the extensions are for files with types other than that conventionally associated with that extension, most of these approaches to file detection break down without customization. Fortunately, most of these approaches provide the ability to customize the mapping of file extensions to file types. The Tika approach using Tika.detect(File) was generally the most accurate in the examples shown in this post when the extensions were not the conventional ones for the particular file types.

Conclusion

There are numerous mechanisms available for simple file type detection in Java. This post reviewed some of the standard JDK approaches for file detection and some examples of using Tika for file detection.

Monday, January 26, 2015

Reading Large Lines Slower in JDK 7 and JDK 8

I recently ran into a case where a particular task (LineContainsRegExp) in an Apache Ant build file ran considerably slower in JDK 7 and JDK 8 than it did in JDK 6for extremely long character lines. Based on a simple example adapted from the Java code used by the LineContainsRegExp task, I was able to determine that the slowness has nothing to do with the regular expression, but rather has to do with reading characters from a file. The remainder of the post demonstrates this.

For my simple test, I first wrote a small Java class to write out a file that includes a line with as many characters as specified on the command line. The simple class, FileMaker, is shown next:

FileMaker.java

import static java.lang.System.out;

import java.io.FileWriter;

/**
 * Writes a file with a line that contains the number of characters provided.
 */
public class FileMaker
{
   /**
    * Create a file with a line that has the number of characters specified.
    *
    * @param arguments Command-line arguments where the first argument is the
    *    name of the file to be written and the second argument is the number
    *   of characters to be written on a single line in the output file.
    */
   public static void main(final String[] arguments)
   {
      if (arguments.length > 1)
      {
         final String fileName = arguments[0];
         final int maxRowSize = Integer.parseInt(arguments[1]);
         try
         {
            final FileWriter fileWriter = new FileWriter(fileName);
            for (int count = 0; count < maxRowSize; count++)
            {
               fileWriter.write('.');
            }
            fileWriter.flush();
         }
         catch (Exception ex)
         {
            out.println("ERROR: Cannot write file '" + fileName + "': " + ex.toString());
         }
      }
      else
      {
         out.println("USAGE: java FileMaker <fileName> <maxRowSize>");
         System.exit(-1);
      }
   }
}

The above Java class exists solely to generate a file with a line that has as many characters as specified (actually one more than specified when the \n is counted). The next class actually demonstrates the difference between the runtime behavior between Java 6 and Java 7. The code for this Main class is adapted from Ant classes that help perform the file reading functionality used by LineContainsRegExp without the regular expression matching. In other words, the regular expression support is not included in my example, but this class executes much more quickly for very large lines when run in Java 6 than when run in Java 7 or Java 8.

Main.java

import static java.lang.System.out;

import java.io.IOException;
import java.io.FileReader;
import java.io.Reader;
import java.util.concurrent.TimeUnit;

/**
 * Adapted from and intended to represent the basic character reading from file
 * used by the Apache Ant class org.apache.tools.ant.filters.LineContainsRegExp.
 */
public class Main
{
   private Reader in;
   private String line;

   public Main(final String nameOfFile)
   {
      if (nameOfFile == null || nameOfFile.isEmpty())
      {
         throw new IllegalArgumentException("ERROR: No file name provided.");
      }
      try
      {
         in = new FileReader(nameOfFile);
      }
      catch (Exception ex)
      {
         out.println("ERROR: " + ex.toString());
         System.exit(-1);
      }
   }
   

   /**
    * Read a line of characters through '\n' or end of stream and return that
    * line of characters with '\n'; adapted from readLine() method of Apache Ant
    * class org.apache.tools.ant.filters.BaseFilterReader.
    */
   protected final String readLine() throws IOException
   {
      int ch = in.read();

      if (ch == -1)
      {
         return null;
      }
        
      final StringBuilder line = new StringBuilder();

      while (ch != -1)
      {
         line.append ((char) ch);
         if (ch == '\n')
         {
            break;
         }
         ch = in.read();
      }

      return line.toString();
   }

   /**
    * Provides the next character in the stream; adapted from the method read()
    * in the Apache Ant class org.apache.tools.ant.filters.LineContainsRegExp.
    */
   public int read() throws IOException
   {
      int ch = -1;
 
      if (line != null)
      {
         ch = line.charAt(0);
         if (line.length() == 1)
         {
            line = null;
         }
         else
         {
            line = line.substring(1);
         }
      }
      else
      {
         for (line = readLine(); line != null; line = readLine())
         {
            if (line != null)
            {
               return read();
            }
         }
      }
      return ch;
   }

   /**
    * Process provided file and read characters from that file and display
    * those characters on standard output.
    *
    * @param arguments Command-line arguments; expect one argument which is the
    *    name of the file from which characters should be read.
    */
   public static void main(final String[] arguments) throws Exception
   {
      if (arguments.length > 0)
      {
        final long startTime = System.currentTimeMillis();
         out.println("Processing file '" + arguments[0] + "'...");
         final Main instance = new Main(arguments[0]);
         int characterInt = 0;
         int totalCharacters = 0;
         while (characterInt != -1)
         {
            characterInt = instance.read();
            totalCharacters++;
         }
         final long endTime = System.currentTimeMillis();
         out.println(
              "Elapsed Time of "
            + TimeUnit.MILLISECONDS.toSeconds(endTime - startTime)
            + " seconds for " + totalCharacters + " characters.");
      }
      else
      {
         out.println("ERROR: No file name provided.");
      }
   }
}

The runtime performance difference when comparing Java 6 to Java 7 or Java 8 is more pronounced as the lines get larger in terms of number of characters. The next screen snapshot demonstrates running the example in Java 6 (indicated by "jdk1.6" being part of path name of java launcher) and then in Java 8 (no explicit path provided because Java 8 is my default JRE) against a freshly generated file called dustin.txt that includes a line with 1 million (plus one) characters.

Although a Java 7 example is not shown in the screen snapshot above, my tests have shown that Java 7 has similar slowness to Java 8 in terms of processing very lone lines. Also, I have seen this in Windows and RedHat Linux JVMs. As the example indicates, the Java 6 version, even for a million characters in a line, reads the file in what rounds to 0 seconds. When the same compiled-for-Java-6 class file is executed with Java 8, the average length of time to handle the 1 million characters is over 150 seconds (2 1/2 minutes). This same slowness applies when the class is executed in Java 7 and also exists even when the class is compiled with JDK 7 or JDK 8.

Java 7 and Java 8 seem to be exponentially slower reading file characters as the number of characters on a line increases. When I raise the 1 million character line to 10 million characters as shown in the next screen snapshot, Java 6 still reads those very fast (still rounded to 0 seconds), but Java 8 requires over 5 hours to complete the task!

I don't know why Java 7 and Java 8 read a very long line from a file so much slower than Java 6 does. I hope that someone else can explain this. While I have several ideas for working around the issue, I would like to understand why Java 7 and Java 8 read lines with very large number of characters so much slower than Java 6. Here are the observations that can be made based on my testing:

The issue appears to be a runtime issue (JRE) rather than a JDK issue because even the file-reading class compiled with JDK 6 runs significantly slower in JRE 7 and JRE 8.
Both the Windows 8 and RedHat Linux JRE environments consistently indicated that the file reading is dramatically slower for very large lines in Java 7 and in Java 8 than in Java 6.
Processing time for reading very long lines appears to increase exponentially with the number of characters in the line in Java 7 and Java 8.

Saturday, December 27, 2014

Three Common Methods Generated in Three Java IDEs

In this post, I look at the differences in three "common" methods [equals(Object), hashCode(), and toString()] as generated by NetBeans 8.0.2, IntelliJ IDEA 14.0.2, and Eclipse Luna 4.4.1. The objective is not to determine which is best, but to show different approaches one can use for implementing these common methods. Along the way, some interesting insights can be picked up regarding creating of these common methods based on what the IDEs assume and prompt the developer to set.

NetBeans 8.0.2

NetBeans 8.0.2 allows the Project Properties to be configured to support the JDK 8 platform and to expect JDK 8 source formatting as shown in the next two screen snapshots.

Code is generated in NetBeans 8.0.2 by clicking on Source | Insert Code (or keystrokes Alt+Insert).

When generating the methods equals(Object), hashCode(), and toString(), NetBeans 8.0.2 asks for the attributes to be used in each of these generated methods as depicted in the next two screen snapshots.

The NetBeans-generated methods take advantage of the JDK 7-introduced Objects class.

NetBeans-Generated hashCode() Method for Class NetBeans802GeneratedCommonMethods.java

@Override
public int hashCode()
{
   int hash = 5;
   hash = 29 * hash + Objects.hashCode(this.someString);
   hash = 29 * hash + Objects.hashCode(this.timeUnit);
   hash = 29 * hash + this.integer;
   hash = 29 * hash + Objects.hashCode(this.longValue);
   return hash;
}

NetBeans-Generated equals(Object) Method for Class NetBeans802GeneratedCommonMethods.java

@Override
public boolean equals(Object obj)
{
   if (obj == null)
   {
      return false;
   }
   if (getClass() != obj.getClass())
   {
      return false;
   }
   final NetBeans802GeneratedCommonMethods other = (NetBeans802GeneratedCommonMethods) obj;
   if (!Objects.equals(this.someString, other.someString))
   {
      return false;
   }
   if (this.timeUnit != other.timeUnit)
   {
      return false;
   }
   if (this.integer != other.integer)
   {
      return false;
   }
   if (!Objects.equals(this.longValue, other.longValue))
   {
      return false;
   }
   return true;
}

NetBeans-Generated toString() Method for Class NetBeans802GeneratedCommonMethods.java

@Override
public String toString()
{
   return "NetBeans802GeneratedCommonMethods{" + "someString=" + someString + ", timeUnit=" + timeUnit + ", integer=" + integer + ", longValue=" + longValue + '}';
}

Some observations can be made regarding the NetBeans-generated common methods:

All generated code is automatic and does not support customization with the exception of the fields used in the methods which the operator selects.
All of these common methods that extend counterparts in the Object class automatically have the @Override annotation provided.
No Javadoc documentation is included for generated methods.
The methods make use of the Objects class to make the generated code more concise with less need for null checks.
Only one format is supported for the String generated by toString() and that output format is a single comma-delimited line.
I did not show it in the above example, but NetBeans 8.0.2's methods generation does treat arrays differently than references, enums, and primitives in some cases:
- The generated toString() method treats array attributes of the instance like it treats other instance attributes: it relies on the array's toString(), which leads to often undesirable and typically useless results (the array's system identity hash code). It'd generally be preferable to have the string contents of array attributes provided by Arrays.toString(Object[]) or equivalent overloaded version or Arrays.deepToString(Object[]).
- The generated hashCode() method uses Arrays.deepHashCode(Object[]) for handling arrays' hash codes.
- The generated equals(Object) method uses Arrays.deepEquals(Object[], Object[]) for handling arrays' equality checks.
- It is worth highlighting here that NetBeans uses the "deep" versions of the Arrays methods for comparing arrays for equality and computing arrays' hash codes while IntelliJ IDEA and Eclipse use the regular (not deep) versions of Arrays methods for comparing arrays for equality and computing arrays' hash codes.

IntelliJ IDEA 14.0.2

For these examples, I'm using IntelliJ IDEA 14.0.2 Community Edition.

IntelliJ IDEA 14.0.2 provides the ability to configure the Project Structure to expect a "Language Level" of JDK 8.

To generate code in IntelliJ IDEA 14.0.2, one uses the Code | Generate options (or keystrokes Alt+Insert like NetBeans).

IntelliJ IDEA 14.0.2 prompts the operator for which attributes should be included in the generated methods. It also asks which fields are non-null, meaning which fields are assumed to never be null. In the snapshot shown here, they are checked, which would lead to methods not checking those attributes for null before trying to access them. In the code that I generate with IntelliJ IDEA for this post, however, I won't have those checked, meaning that IntelliJ IDEA will check for null before accessing them in the generated methods.

IntelliJ IDEA 14.0.2's toString() generation provides a lengthy list of formats (templates) for the generated toString() method.

IntelliJ IDEA 14.0.2 also allows the operator to select the attributes to be included in the generated toString() method (selected when highlighted background is blue).

IDEA-Generated equals(Object) Method for Class Idea1402GeneratedCommonMethods.java

public boolean equals(Object o)
{
   if (this == o) return true;
   if (o == null || getClass() != o.getClass()) return false;

   Idea1402GeneratedCommonMethods that = (Idea1402GeneratedCommonMethods) o;

   if (integer != that.integer) return false;
   if (longValue != null ? !longValue.equals(that.longValue) : that.longValue != null) return false;
   if (someString != null ? !someString.equals(that.someString) : that.someString != null) return false;
   if (timeUnit != that.timeUnit) return false;

   return true;
}

IDEA-Generated hashCode() Method for Class Idea1402GeneratedCommonMethods.java

@Override
public int hashCode()
{
   int result = someString != null ? someString.hashCode() : 0;
   result = 31 * result + (timeUnit != null ? timeUnit.hashCode() : 0);
   result = 31 * result + integer;
   result = 31 * result + (longValue != null ? longValue.hashCode() : 0);
   return result;
}

IDEA-Generated toString() Method for Class Idea1402GeneratedCommonMethods.java

@Override
public String toString()
{
   return "Idea1402GeneratedCommonMethods{" +
      "someString='" + someString + '\'' +
      ", timeUnit=" + timeUnit +
      ", integer=" + integer +
      ", longValue=" + longValue +
      '}';
}

Some observations can be made regarding the IntelliJ IDEA-generated common methods:

Most generated code is automatic with minor available customization including the fields used in the methods which the operator selects, specification of which fields are expected to be non-null (so that null checks are not needed in generated code), and the ability to select one of eight built-in toString() formats.
All of these common methods that extend counterparts in the Object class automatically have the @Override annotation provided.
No Javadoc documentation is included for generated methods.
The generated methods do not make use of the Objects class and so require explicit checks for null for all references that could be null.
It's not shown in the above example, but IntelliJ IDEA 14.0.2 does treat arrays differently in the generation of these three common methods:
- Generated toString() method uses Arrays.toString(Array) on the array.
- Generated hashCode() method uses Arrays.hashCode(Object[]) (or overloaded version) on the array.
- Generated equals(Object) method uses Arrays.equals(Object[], Object[]) (or overloaded version) on the array.

Eclipse Luna 4.4.1

Eclipse Luna 4.4.1 allows the Java Compiler in Project Properties to be set to JDK 8.

In Eclipse Luna, the developer uses the "Source" drop-down to select the specific type of source code generation to be performed.

Eclipse Luna allows the operator to select the attributes to be included in the common methods. It also allows the operator to specify a few characteristics of the generated methods. For example, the operator can choose to have the elements of an array printed individually in the generated toString() method rather than an often meaningless class name and system identity hash code presented.

Eclipse-Generated hashCode() Method for Class Eclipse441GeneratedCommonMethods.java

/* (non-Javadoc)
 * @see java.lang.Object#hashCode()
 */
@Override
public int hashCode()
{
   final int prime = 31;
   int result = 1;
   result = prime * result + this.integer;
   result = prime * result
         + ((this.longValue == null) ? 0 : this.longValue.hashCode());
   result = prime * result
         + ((this.someString == null) ? 0 : this.someString.hashCode());
   result = prime * result
         + ((this.timeUnit == null) ? 0 : this.timeUnit.hashCode());    return result;
}

Eclipse-Generated equals(Object) Method for Class Eclipse441GeneratedCommonMethods.java

/* (non-Javadoc)
 * @see java.lang.Object#equals(java.lang.Object)
 */
@Override
public boolean equals(Object obj)
{
   if (this == obj)
      return true;
   if (obj == null)
      return false;
   if (getClass() != obj.getClass())
      return false;
   Eclipse441GeneratedCommonMethods other = (Eclipse441GeneratedCommonMethods) obj;
   if (this.integer != other.integer)
      return false;
   if (this.longValue == null)
   {
      if (other.longValue != null)
         return false;
   } else if (!this.longValue.equals(other.longValue))
     return false;
   if (this.someString == null)
   {
      if (other.someString != null)
         return false;
   } else if (!this.someString.equals(other.someString))
      return false;
   if (this.timeUnit != other.timeUnit)
      return false;
   return true;
}

Eclipse-Generated toString() Method for Class Eclipse441GeneratedCommonMethods.java

/* (non-Javadoc)
 * @see java.lang.Object#toString()
 */
@Override
public String toString()
{
   return "Eclipse441GeneratedCommonMethods [someString=" + this.someString
         + ", timeUnit=" + this.timeUnit + ", integer=" + this.integer
         + ", longValue=" + this.longValue + "]";
}

Some observations can be made regarding the Eclipse-generated common methods:

Eclipse provides the most points in the generation process in which the generated output can be configured. Here are some of the configurable options:
- Location in class (before or after existing methods of class) can be explicitly specified.
- All of these common methods that extend counterparts in the Object class automatically have the @Override annotation provided.
- "Method comments" can be generated, but they are not Javadoc style comments (use /* instead of /** and explicitly state they are not Javadoc comments as part of the generated comment).
- Option to "list contents of arrays instead of using native toString()" allows developer to have Arrays.toString(Array) be used (same as IntelliJ IDEA's approach and occurs if checked) or have the system identify hash code be used (same as NetBeans's approach and occurs if not checked).
- Support for four toString() styles plus ability to specify custom style.
- Ability to limit the number of entries of an array, collection, or map that is printed in toString().
- Ability to use instance of in generated equals(Object) implementation.
All of these common methods that extend counterparts in the Object class automatically have the @Override annotation provided.
The generated methods do not make use of the Objects class and so require explicit checks for null for all references that could be null.
Eclipse Luna 4.4.1 does treat arrays differently when generating the three common methods highlighted in this post:
- Generated toString() optionally uses Arrays.toString(Object[]) or overloaded version for accessing contents of array.
- Generated equals(Object) uses Arrays.equals(Object[], Object[]) or overloaded version for comparing arrays for equality.
- Generated hashCode() uses Arrays.hashCode(Object[]) or overloaded version for computing hash code of array.

Conclusion

All three IDEs covered in this post (NetBeans, IntelliJ IDEA, and Eclipse) generate sound implementations of the common methods equals(Object), hashCode(), and toString(), but there are differences between the customizability of these generated methods across the three IDEs. The different customizations that are available and the different implementations that are generated can provide lessons for developers new to Java to learn about and consider when implementing these methods. While the most obvious and significant advantage of these IDEs' ability to generate these methods is the time savings associated with this automatic generation, other advantages of IDE generation of these methods include the ability to learn about implementing these methods and the greater likelihood of successful implementations without typos or other errors.

Inspired by Actual Events

Dustin's Pages

Friday, August 17, 2018

Carefully Specify Multiple Resources in Single try-with-resources Statement

Saturday, February 24, 2018

Prefer System.lineSeparator() for Writing System-Dependent Line Separator Strings in Java

Monday, February 29, 2016

jcmd: One JDK Command-Line Tool to Rule Them All

Wednesday, February 18, 2015

Determining File Types in Java

Monday, January 26, 2015

Reading Large Lines Slower in JDK 7 and JDK 8

Saturday, December 27, 2014

Three Common Methods Generated in Three Java IDEs

My Non-Technical Blog

Software Development

Affiliations

Affiliations

Support Wikipedia