Skip to content

Commit

Permalink
Return null for invalid datetime format when legacy date formatter is…
Browse files Browse the repository at this point in the history
… used (#11131)

Summary:
For Spark functions 'unix_timestamp', 'from_unixtime' and 'get_timestamp',
returns null for invalid datetime format when legacy date formatter is used.

Relating to #10354.

Pull Request resolved: #11131

Reviewed By: xiaoxmeng

Differential Revision: D65517767

Pulled By: mbasmanova

fbshipit-source-id: 863d3955caa64317c306458ff917e13e2593bc8f
  • Loading branch information
NEUpanning authored and facebook-github-bot committed Nov 7, 2024
1 parent d1bf9da commit 64661a3
Show file tree
Hide file tree
Showing 3 changed files with 104 additions and 38 deletions.
22 changes: 13 additions & 9 deletions velox/docs/functions/spark/datetime.rst
Original file line number Diff line number Diff line change
Expand Up @@ -82,14 +82,14 @@ These functions support TIMESTAMP and DATE input types.
Adjusts ``unixTime`` (elapsed seconds since UNIX epoch) to configured session timezone, then
converts it to a formatted time string according to ``format``. Only supports BIGINT type for
``unixTime``. Using `Simple <https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html>`
``unixTime``. Using `Simple <https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html>`_
date formatter in lenient mode that is align with Spark legacy date parser behavior or
`Joda <https://www.joda.org/joda-time/>` date formatter depends on ``spark.legacy_date_formatter`` configuration.
`Joda <https://www.joda.org/joda-time/>`_ date formatter depends on ``spark.legacy_date_formatter`` configuration.
`Valid patterns for date format
<https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_. Throws exception for
invalid ``format``. This function will convert input to milliseconds, and integer overflow is
allowed in the conversion, which aligns with Spark. See the below third example where INT64_MAX
is used, -1000 milliseconds are produced by INT64_MAX * 1000 due to integer overflow. ::
<https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_. When `Simple` date formatter is used,
null is returned for invalid ``format``; otherwise, exception is thrown. This function will convert input to
milliseconds, and integer overflow is allowed in the conversion, which aligns with Spark. See the below third
example where INT64_MAX is used, -1000 milliseconds are produced by INT64_MAX * 1000 due to integer overflow. ::

SELECT from_unixtime(100, 'yyyy-MM-dd HH:mm:ss'); -- '1970-01-01 00:01:40'
SELECT from_unixtime(3600, 'yyyy'); -- '1970'
Expand All @@ -113,7 +113,11 @@ These functions support TIMESTAMP and DATE input types.
The format follows Spark's
`Datetime patterns
<https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_.
Returns NULL for parsing error or NULL input. Throws exception for invalid format. ::
Using `Simple <https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html>`_
date formatter in lenient mode that is align with Spark legacy date parser behavior or
`Joda <https://www.joda.org/joda-time/>`_ date formatter depends on ``spark.legacy_date_formatter`` configuration.
Returns NULL for parsing error or NULL input. When `Simple` date formatter is used, null is returned for invalid
``dateFormat``; otherwise, exception is thrown. ::

SELECT get_timestamp('1970-01-01', 'yyyy-MM-dd); -- timestamp `1970-01-01`
SELECT get_timestamp('1970-01-01', 'yyyy-MM'); -- NULL (parsing error)
Expand Down Expand Up @@ -288,8 +292,8 @@ These functions support TIMESTAMP and DATE input types.
.. spark:function:: unix_timestamp() -> integer
Returns the current UNIX timestamp in seconds. Using
`Simple <https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html>` date formatter in lenient mode
that is align with Spark legacy date parser behavior or `Joda <https://www.joda.org/joda-time/>` date formatter
`Simple <https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html>`_ date formatter in lenient mode
that is align with Spark legacy date parser behavior or `Joda <https://www.joda.org/joda-time/>`_ date formatter
depends on the ``spark.legacy_date_formatter`` configuration.

.. spark:function:: unix_timestamp(string) -> integer
Expand Down
95 changes: 66 additions & 29 deletions velox/functions/sparksql/DateTimeFunctions.h
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,32 @@ Expected<std::shared_ptr<DateTimeFormatter>> getDateTimeFormatter(
std::string_view(format.data(), format.size()));
}
}

// Creates datetime formatter based on the provided format string. When legacy
// formatter is used, returns nullptr for invalid format; otherwise, throws a
// user error.
//
// @param format The format string to be used for initializing the formatter.
// @param legacyFormatter Whether legacy formatter is used.
// @return A shared pointer to a DateTimeFormatter. If the formatter
// initialization fails and the legacy formatter is used, nullptr is returned.
inline std::shared_ptr<DateTimeFormatter> initializeFormatter(
const std::string_view format,
bool legacyFormatter) {
auto formatter = detail::getDateTimeFormatter(
std::string_view(format),
legacyFormatter ? DateTimeFormatterType::STRICT_SIMPLE
: DateTimeFormatterType::JODA);
// When legacy formatter is used, returns nullptr for invalid format;
// otherwise, throws a user error.
if (formatter.hasError()) {
if (legacyFormatter) {
return nullptr;
}
VELOX_USER_FAIL(formatter.error().message());
}
return formatter.value();
}
} // namespace detail

template <typename T>
Expand Down Expand Up @@ -245,43 +271,51 @@ struct FromUnixtimeFunction {
legacyFormatter_ = config.sparkLegacyDateFormatter();
sessionTimeZone_ = getTimeZoneFromConfig(config);
if (format != nullptr) {
setFormatter(*format);
auto formatter = detail::initializeFormatter(
std::string_view(*format), legacyFormatter_);
if (formatter) {
formatter_ = formatter;
maxResultSize_ = formatter_->maxResultSize(sessionTimeZone_);
} else {
invalidFormat_ = true;
}
isConstantTimeFormat_ = true;
}
}

FOLLY_ALWAYS_INLINE void call(
FOLLY_ALWAYS_INLINE bool call(
out_type<Varchar>& result,
const arg_type<int64_t>& second,
const arg_type<Varchar>& format) {
if (invalidFormat_) {
return false;
}
if (!isConstantTimeFormat_) {
setFormatter(format);
auto formatter = detail::initializeFormatter(
std::string_view(format), legacyFormatter_);
if (formatter) {
formatter_ = formatter;
maxResultSize_ = formatter_->maxResultSize(sessionTimeZone_);
} else {
return false;
}
}
const Timestamp timestamp{second, 0};
result.reserve(maxResultSize_);
int32_t resultSize;
resultSize = formatter_->format(
timestamp, sessionTimeZone_, maxResultSize_, result.data(), true);
result.resize(resultSize);
return true;
}

private:
FOLLY_ALWAYS_INLINE void setFormatter(const arg_type<Varchar>& format) {
formatter_ = detail::getDateTimeFormatter(
std::string_view(format.data(), format.size()),
legacyFormatter_ ? DateTimeFormatterType::STRICT_SIMPLE
: DateTimeFormatterType::JODA)
.thenOrThrow(folly::identity, [&](const Status& status) {
VELOX_USER_FAIL("{}", status.message());
});
maxResultSize_ = formatter_->maxResultSize(sessionTimeZone_);
}

const tz::TimeZone* sessionTimeZone_{nullptr};
std::shared_ptr<DateTimeFormatter> formatter_;
uint32_t maxResultSize_;
bool isConstantTimeFormat_{false};
bool legacyFormatter_{false};
bool invalidFormat_{false};
};

template <typename T>
Expand Down Expand Up @@ -361,29 +395,31 @@ struct GetTimestampFunction {
sessionTimeZone_ = tz::locateZone(sessionTimezoneName);
}
if (format != nullptr) {
formatter_ = detail::getDateTimeFormatter(
std::string_view(*format),
legacyFormatter_ ? DateTimeFormatterType::STRICT_SIMPLE
: DateTimeFormatterType::JODA)
.thenOrThrow(folly::identity, [&](const Status& status) {
VELOX_USER_FAIL("{}", status.message());
});
isConstantTimeFormat_ = true;
auto formatter = detail::initializeFormatter(
std::string_view(*format), legacyFormatter_);
if (formatter) {
formatter_ = formatter;
} else {
invalidFormat_ = true;
}
}
}

FOLLY_ALWAYS_INLINE bool call(
out_type<Timestamp>& result,
const arg_type<Varchar>& input,
const arg_type<Varchar>& format) {
if (invalidFormat_) {
return false;
}
if (!isConstantTimeFormat_) {
formatter_ = detail::getDateTimeFormatter(
std::string_view(format),
legacyFormatter_ ? DateTimeFormatterType::STRICT_SIMPLE
: DateTimeFormatterType::JODA)
.thenOrThrow(folly::identity, [&](const Status& status) {
VELOX_USER_FAIL("{}", status.message());
});
auto formatter = detail::initializeFormatter(
std::string_view(format), legacyFormatter_);
if (formatter) {
formatter_ = formatter;
} else {
return false;
}
}
auto dateTimeResult = formatter_->parse(std::string_view(input));
// Null as result for parsing error.
Expand All @@ -406,6 +442,7 @@ struct GetTimestampFunction {
bool isConstantTimeFormat_{false};
const tz::TimeZone* sessionTimeZone_{tz::locateZone(0)}; // default to GMT.
bool legacyFormatter_{false};
bool invalidFormat_{false};
};

template <typename T>
Expand Down
25 changes: 25 additions & 0 deletions velox/functions/sparksql/tests/DateTimeFunctionsTest.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,12 @@ class DateTimeFunctionsTest : public SparkFunctionBaseTest {
});
}

void enableLegacyFormatter() {
queryCtx_->testingOverrideConfigUnsafe({
{core::QueryConfig::kSparkLegacyDateFormatter, "true"},
});
}

template <typename TOutput, typename TValue>
std::optional<TOutput> evaluateDateFuncOnce(
const std::string& expr,
Expand Down Expand Up @@ -262,6 +268,9 @@ TEST_F(DateTimeFunctionsTest, unixTimestamp) {
EXPECT_EQ(std::nullopt, unixTimestamp("00:00:00"));
EXPECT_EQ(std::nullopt, unixTimestamp(""));
EXPECT_EQ(std::nullopt, unixTimestamp("malformed input"));
enableLegacyFormatter();
EXPECT_EQ(std::nullopt, unixTimestamp(""));
EXPECT_EQ(std::nullopt, unixTimestamp("malformed input"));
}

TEST_F(DateTimeFunctionsTest, unixTimestampCurrent) {
Expand Down Expand Up @@ -803,6 +812,14 @@ TEST_F(DateTimeFunctionsTest, getTimestamp) {
VELOX_ASSERT_THROW(
getTimestamp("2023-07-13 21:34", "yyyy-MM-dd HH:II"),
"Specifier I is not supported");

// Returns null for invalid datetime format when legacy date formatter is
// used.
enableLegacyFormatter();
// Empty format.
EXPECT_EQ(getTimestamp("0", ""), std::nullopt);
// Unsupported specifier.
EXPECT_EQ(getTimestamp("0", "l"), std::nullopt);
}

TEST_F(DateTimeFunctionsTest, hour) {
Expand Down Expand Up @@ -926,6 +943,14 @@ TEST_F(DateTimeFunctionsTest, fromUnixtime) {
fromUnixTime(0, "FF/MM/dd"), "Specifier F is not supported");
VELOX_ASSERT_THROW(
fromUnixTime(0, "yyyy-MM-dd HH:II"), "Specifier I is not supported");

// Returns null for invalid datetime format when legacy date formatter is
// used.
enableLegacyFormatter();
// Empty format.
EXPECT_EQ(fromUnixTime(0, ""), std::nullopt);
// Unsupported specifier.
EXPECT_EQ(fromUnixTime(0, "l"), std::nullopt);
}

TEST_F(DateTimeFunctionsTest, makeYMInterval) {
Expand Down

0 comments on commit 64661a3

Please sign in to comment.