We're an ISO27001:2013 Certified Supplier

scrabble

From time to time, I’ve wanted to import a simple text file into LibreOffice Writer or other word processing programs. The challenge when doing so is that most text files have embedded newlines in them, and word processing programs interpret newlines as paragraph markers:

The usual technique is to either fix up the import by hand, which is OK for small files, or to go through a tedious sequence of edits to remove all newlines. This looks something like:

  • search for two consecutive newlines and replace them with the word NEWPARA
  • search for and remove all remaining newlines
  • search for NEWPARA and replace it with a newline

That almost works, but you’ll quickly realise that step 2 should be “search for all remaining newlines and replace with a space”. I also often find that the resulting file has some instances of two consecutive spaces.

In real life

Right now, I’m working on a book and the challenge I faced recently was importing 34 text files into one word processing document. The prospect of manually fixing up all 34 files was daunting. This is Linux: there must be an easier way!

I looked at sed, the stream editor. That’s great for making the same changes throughout a file or series of files, which is exactly what I want.

The first challenge is that, by default, sed operates on the input file one line at a time. Changing ‘dog’ to ‘cat’ would be easy:

$ sed --in-place -e 's/dog/cat/g' myfile.txt

But the first replacement I want to do, changing two consecutive newlines toNEWPARA, by definition requires sed to operate on more than one line at a time.

The solution is to have sed read all the file before starting. This command will carry out the first step of our process, that of replacing two consecutive newlines with the wordNEWPARA:

$ sed ':a;N;$!ba;s/\n\n/NEWPARA/' <myfile.txt >output.txt

How it works

  • :a Create a label called ‘a’
  • ; The semicolon is the sed multi-statement separator
  • N Read and append the next line of input into the pattern space
  • ; Separator
  • $!ba Branch to label a unless we are at the end of the file
  • ; Separator
  • s/\n\n/NEWPARA/ Replace all instances of two newline characters with NEWPARA

The final version

I appended three more substitutes with sed:

  • s/\n/ /g – replace a single newline with a space
  • s/NEWPARA/\n/g – replace NEWPARA with a newline
  • s/ +/ /g – replace one or more spaces with a single space

Split over multiple lines for readability, the final command was:

$ sed \
':a;N;$!ba;s/\n\n/NEWPARA/g;s/\n/ /g;s/NEWPARA/\n/g;s/ +/ /g' \
output.txt

The resulting file can be read into a word processor and will have the correct paragraph breaks and no sequences of multiple spaces.

Was this techtip useful?

Let us know in the comments below.

Leave a Reply

Your email address will not be published. Required fields are marked *

Secure. Reliable. Scalable.

If that doesn't describe your current Linux systems, check out our FREE Linux Survival Guide to help you get your systems up to scratch today!

  • This field is for validation purposes and should be left unchanged.