2022-05-14

Gremlin: Shortest logistical route between A and B while respecting schedules + other constraints

Preface

  • I'm new to Gremlin and working through Kelvin Lawrence's awesome eBook on the topic in order to solve a specific use-case.
  • Due to the sheer amount to learn, I'm asking this question to get recommendations on how I might approach the challenge so that, as I read the eBook, I'll better know the sections to which to pay extra attention.
  • I intend to use AWS Neptune in the pursuit of solving this, so I tagged that topic as well.

Question

Respecting departure/arrival times of legs + other constraints, can the shortest path (the real-world, logistical meaning of "path") between origin and destination be "queried" (i.e., can I use the Gremlin console with a single statement)? Or is the use-case of such complexity that I will effectively need to write a program to accomplish it?


Use-Case / Detail

I hope to answer the question:

Starting at ORIGIN on DAY, can I get to DESTINATION while respecting [CONDITIONS]?

The good news is that I only need a true/false response (so limit(1)?) and a lack of a result (e.g., []) suffices for "no".

What are the conditions?

  • Flight schedules need to be respected. Instead of simple flight routes (i.e., a connection exists between BOSton and DALlas), I have actual flight schedules (i.e., on Wednesday, 9 Nov 2022 at 08:40, flight XYZ will depart BOSton and then arrive DALlas at 13:15) ... consequently, if/when there are connections, I need to respect arrival and departure times + some sort of buffer (i.e., a path for which a Traveler would arrive at 13:05 and depart on another leg at 13:06 isn't actually a valid path);
  • Aggregate travel time / cost limits. The answer to the question needs to be "No" if a path's aggregate travel time or aggregate cost exceeds specified limits. (Here, I believe I'll need to use sack() to track the cost - financial and time - of each leg and bail out of the repeat() until loop when either is hit?)

I apologize b/c I know this isn't a good StackOverflow question, since it's not technically specific -- my hope is that, at least, some specific technical recommendations might result.

The use-case seems like the varsity / pro version of the flight routes example presented in the eBook, which is perfect for someone brand-new to Gremlin ... 😅


Updates / Follow-up Questions

UPDATE 1

Thank you very much, @Kelvin Lawrence, for responding A few questions as I think the data model through:

  1. Do you happen to know what is meant by ORD_FLIGHT in the edges between AirportDay and Flight vertices in the second blog post?
  2. Since the aggregate travel time (i.e., flight duration(s) + layover(s)) is an important, limiting condition, would it be more efficient to add Duration as a property of a Flight (i.e., calculate it during data ingestion)? Or is it better to use the Departs and Arrives properties, at query-time, to calculate Duration?
  3. Along the lines of Q2, what are your thoughts on adding the best-case (i.e., fastest) duration to the "first" graph (i.e., Flight Routes)? Thinking: could apply a "naive travel time limit" to the first traversal, disqualifying candidate DESTINATIONs that are clearly unreachable time-wise -- winnowing down the set of paths for the second traversal.

UPDATE 2

  1. Here's my first attempt at the data model (notes continue after the break):

Stab at data model

  1. First, I've generalized Airports ==> Ports and Flights ==> Voyages, because my use-case could grow to encompass more than flights (i.e., rail and sea transport could be a thing).
  2. Second (and more importantly), reflecting on maxdemarzi's blog post, I believe his model creates a challenge that is never addressed. Specifically, his proposed model effectively changes the motivating question from:

I want to depart Airport A on Day D and arrive Airport B -- what are the best routing options?

to:

I want to depart on AirportDay A -- what are the best routing options to AirportDay B?

The issue with this approach is that it's invalid to assume the Day component of the AirportDay vertex identifier will be the same -- doing so will constrain the search to same-day paths.

To illustrate: How would that model discover a path that includes the gain (or loss) of a day due to crossing the International Date Line -or- "crossing midnight" (e.g., depart Boston 5 Jan @ 2030, arrive Barcelona 6 Jan @ 0900)?

  1. I'm not sure my solution is implementable (so please validate, as this is very important), but I propose a dayChange property on the Voyage node. This will usually be 0, occasionally -1 or +1, and (for multi-day duration Voyages) could be up to +N. My thinking is that this could be used during traversal to dynamically determine "target" vertices:

Example: I want to depart BOS on 5-Jan and get to BCN -- what are the best routing options?

  • Start traversal at AirportDay node id: BOS-5JAN
  • Get outbound edges
  • One Voyage node has dayChange == +1
  • Determine if the Voyage is connected to BCN-6JAN (BCN == known destination, 6JAN is calculated by adding the +1 to the known start date)
  • If BCN-6JAN has an inbound edge, then one path has been found

The ability to dynamically build the target vertex (BCN + 6JAN) is the crux of the proposed solution.



No comments:

Post a Comment