Split String in Python While Keeping “Delimiter”

Have you ever tried to split a string with a delimiter, but then realize that you actually need the delimiter as part of the content?  I have run into this several times so I want to jot down one way to solve this issue to hopefully help out.

The Problem

I was recently working on a task to import data from a text file.  Each file had one or more data sets, with each data set having identical formats.  The following simple text file replicates the same issues that I was bumping into.

The first thing I wanted to do was to identify the substring of text that corresponded to each chunk, or Person in this case.  Although this isn’t a crazy hard thing to do, it turns out to be a bit more sneaky than I had originally thought.  (Side note: I just realized that Frodo Baggins shares the same birthday as my sister-in-law!  I doubt she will be as excited about it as I am.)

The Final Solution

Cutting to the chase, here is the snippet of code that I came up with to solve this problem along with the output.  I will then get into how I came up with it and why it works.

You may need to adapt this for your specific scenario but this general approach should be able to work if you have the same need.

How it Works

Using the basic string.split method doesn’t work because it gets rid of the delimiter, where in this case we want to keep the “delimiter”.  I put the term “delimiter” in quotes because the reason string.split doesn’t apply to this problem is because there actually is no delimiter; rather, there are consistent patterns repeated multiple times, each starting with some kind of header.  

Once you realize that this is more about identifying patterns rather than using a delimiter, one may shift their focus to the Regular Expressions module re instead.  It turns out that the re.findall method is just the thing for this case, so long that you know how to describe the regex pattern in a robust way.  Sometimes this is a bit tricky since you need to make sure that the pattern holds regardless of the content in each data set.

First Try: Header Slurp

It is pretty simple to start off knowing that we want to find a pattern something like # Person \d+.*, but unfortunately that doesn’t work because it doesn’t know how much to slurp.  This greedy version ends up taking the entire string, since it is the pattern it finds.  I was hoping that turning this into the non-greedy # Person \d+.*?  would fix it, but the matches stop just after the header:

Second Try: Positive Look-Ahead

To make this more useful, we need to add to the regular expression where to stop for each pattern.  Since we have headers, we know that it should go until the next header, but we don’t want more than one header in each chunk.  The trick here is to use a positive look-ahead assertion, which basically means “slurp until just before this pattern”.  To do this in Python, you use the (?=...) construct.  Knowing this, we can update our pattern to include the positive look-ahead with the following header:

So what happened to our last chunk?

Final Try: Success!

For a match we are requiring that each chunk be followed by a header, which works for all chunks except the last one.  After the last chunk is the end of the file, so we need to let the expression know that the pattern can either be followed by another header, OR by the end of the string.  To do this, we can add a | (or) with $ (end-of-string) to the positive look-ahead assertion:

Whoo hoo!  That did it.  To clean it up and remove redundancy, I just added the header_pattern as an argument to the function and went on my way to do some fun parsing on each chunk.

In Closing

There are always so many ways to solve a problem like this, and I am sure there are several in this case.  One other approach that comes to mind is to execute a re.findall command for each field that we are looking for and then later correlate them and stitch them back together.  This approach would obviate the need to separate based on chunks, but also adds complexity if the chunks are not identically formatted.

I hope this post was helpful to you in solving your problem and please let me know if you have any comments/questions!

Leave a Reply

Your email address will not be published. Required fields are marked *