Class LevenshteinDistance

java.lang.Object
org.apache.commons.text.similarity.LevenshteinDistance
All Implemented Interfaces:
BiFunction<CharSequence, CharSequence, Integer>, EditDistance<Integer>, ObjectSimilarityScore<CharSequence, Integer>, SimilarityScore<Integer>

public class LevenshteinDistance extends Object implements EditDistance<Integer>
An algorithm for measuring the difference between two character sequences using the Levenshtein Distance.

This is the number of changes needed to change one sequence into another, where each change is a single character modification (deletion, insertion or substitution).

This code has been adapted from Apache Commons Lang 3.3.

Since:
1.0
See Also:
  • Field Details

    • INSTANCE

      private static final LevenshteinDistance INSTANCE
      The singleton instance.
    • threshold

      private final Integer threshold
      Threshold.
  • Constructor Details

    • LevenshteinDistance

      @Deprecated public LevenshteinDistance()
      Deprecated.
      Constructs a default instance that uses a version of the algorithm that does not use a threshold parameter.
      See Also:
    • LevenshteinDistance

      public LevenshteinDistance(Integer threshold)
      Constructs a new instance. If the threshold is not null, distance calculations will be limited to a maximum length. If the threshold is null, the unlimited version of the algorithm will be used.
      Parameters:
      threshold - If this is null then distances calculations will not be limited. This may not be negative.
  • Method Details

    • getDefaultInstance

      public static LevenshteinDistance getDefaultInstance()
      Gets the default instance.
      Returns:
      The default instance.
    • limitedCompare

      private static <E> int limitedCompare(SimilarityInput<E> left, SimilarityInput<E> right, int threshold)
      Finds the Levenshtein distance between two CharSequences if it's less than or equal to a given threshold.

      This implementation follows from Algorithms on Strings, Trees and Sequences by Dan Gusfield and Chas Emerick's implementation of the Levenshtein distance algorithm from http://www.merriampark.com/ld.htm

      limitedCompare(null, *, *)             = Throws IllegalArgumentException
      limitedCompare(*, null, *)             = Throws IllegalArgumentException
      limitedCompare(*, *, -1)               = Throws IllegalArgumentException
      limitedCompare("","", 0)               = 0
      limitedCompare("aaapppp", "", 8)       = 7
      limitedCompare("aaapppp", "", 7)       = 7
      limitedCompare("aaapppp", "", 6))      = -1
      limitedCompare("elephant", "hippo", 7) = 7
      limitedCompare("elephant", "hippo", 6) = -1
      limitedCompare("hippo", "elephant", 7) = 7
      limitedCompare("hippo", "elephant", 6) = -1
      
      Parameters:
      left - the first SimilarityInput, must not be null.
      right - the second SimilarityInput, must not be null.
      threshold - the target threshold, must not be negative.
      Returns:
      result distance, or -1
    • unlimitedCompare

      private static <E> int unlimitedCompare(SimilarityInput<E> left, SimilarityInput<E> right)
      Finds the Levenshtein distance between two Strings.

      A higher score indicates a greater distance.

      The previous implementation of the Levenshtein distance algorithm was from https://web.archive.org/web/20120526085419/http://www.merriampark.com/ldjava.htm

      This implementation only need one single-dimensional arrays of length s.length() + 1

      unlimitedCompare(null, *)             = Throws IllegalArgumentException
      unlimitedCompare(*, null)             = Throws IllegalArgumentException
      unlimitedCompare("","")               = 0
      unlimitedCompare("","a")              = 1
      unlimitedCompare("aaapppp", "")       = 7
      unlimitedCompare("frog", "fog")       = 1
      unlimitedCompare("fly", "ant")        = 3
      unlimitedCompare("elephant", "hippo") = 7
      unlimitedCompare("hippo", "elephant") = 7
      unlimitedCompare("hippo", "zzzzzzzz") = 8
      unlimitedCompare("hello", "hallo")    = 1
      
      Parameters:
      left - the first CharSequence, must not be null.
      right - the second CharSequence, must not be null.
      Returns:
      result distance, or -1.
      Throws:
      IllegalArgumentException - if either CharSequence input is null.
    • apply

      public Integer apply(CharSequence left, CharSequence right)
      Computes the Levenshtein distance between two Strings.

      A higher score indicates a greater distance.

      The previous implementation of the Levenshtein distance algorithm was from http://www.merriampark.com/ld.htm

      Chas Emerick has written an implementation in Java, which avoids an OutOfMemoryError which can occur when my Java implementation is used with very large strings.
      This implementation of the Levenshtein distance algorithm is from http://www.merriampark.com/ldjava.htm

      distance.apply(null, *)             = Throws IllegalArgumentException
      distance.apply(*, null)             = Throws IllegalArgumentException
      distance.apply("","")               = 0
      distance.apply("","a")              = 1
      distance.apply("aaapppp", "")       = 7
      distance.apply("frog", "fog")       = 1
      distance.apply("fly", "ant")        = 3
      distance.apply("elephant", "hippo") = 7
      distance.apply("hippo", "elephant") = 7
      distance.apply("hippo", "zzzzzzzz") = 8
      distance.apply("hello", "hallo")    = 1
      
      Specified by:
      apply in interface BiFunction<CharSequence, CharSequence, Integer>
      Specified by:
      apply in interface ObjectSimilarityScore<CharSequence, Integer>
      Specified by:
      apply in interface SimilarityScore<Integer>
      Parameters:
      left - the first input, must not be null.
      right - the second input, must not be null.
      Returns:
      result distance, or -1.
      Throws:
      IllegalArgumentException - if either String input null.
    • apply

      public <E> Integer apply(SimilarityInput<E> left, SimilarityInput<E> right)
      Computes the Levenshtein distance between two inputs.

      A higher score indicates a greater distance.

      distance.apply(null, *)             = Throws IllegalArgumentException
      distance.apply(*, null)             = Throws IllegalArgumentException
      distance.apply("","")               = 0
      distance.apply("","a")              = 1
      distance.apply("aaapppp", "")       = 7
      distance.apply("frog", "fog")       = 1
      distance.apply("fly", "ant")        = 3
      distance.apply("elephant", "hippo") = 7
      distance.apply("hippo", "elephant") = 7
      distance.apply("hippo", "zzzzzzzz") = 8
      distance.apply("hello", "hallo")    = 1
      
      Type Parameters:
      E - The type of similarity score unit.
      Parameters:
      left - the first input, must not be null.
      right - the second input, must not be null.
      Returns:
      result distance, or -1.
      Throws:
      IllegalArgumentException - if either String input null.
      Since:
      1.13.0
    • getThreshold

      public Integer getThreshold()
      Gets the distance threshold.
      Returns:
      The distance threshold.