From time to time, I’ve wanted to import a simple text file into LibreOffice Writer or other word processing programs. The challenge when doing so is that most text files have embedded newlines in them, and word processing programs interpret newlines as paragraph markers:
The usual technique is to either fix up the import by hand, which is OK for small files, or to go through a tedious sequence of edits to remove all newlines. This looks something like:
- search for two consecutive newlines and replace them with the word
- search for and remove all remaining newlines
- search for
NEWPARAand replace it with a newline
That almost works, but you’ll quickly realise that step 2 should be “search for all remaining newlines and replace with a space”. I also often find that the resulting file has some instances of two consecutive spaces.
In real life
Right now, I’m working on a book and the challenge I faced recently was importing 34 text files into one word processing document. The prospect of manually fixing up all 34 files was daunting. This is Linux: there must be an easier way!
I looked at
sed, the stream editor. That’s great for making the same changes throughout a file or series of files, which is exactly what I want.
The first challenge is that, by default,
sed operates on the input file one line at a time. Changing ‘dog’ to ‘cat’ would be easy:
$ sed --in-place -e 's/dog/cat/g' myfile.txt
But the first replacement I want to do, changing two consecutive newlines to
NEWPARA, by definition requires
sed to operate on more than one line at a time.
The solution is to have
sed read all the file before starting. This command will carry out the first step of our process, that of replacing two consecutive newlines with the word
$ sed ':a;N;$!ba;s/\n\n/NEWPARA/' <myfile.txt >output.txt
How it works
:aCreate a label called ‘a’
;The semicolon is the sed multi-statement separator
NRead and append the next line of input into the pattern space
$!baBranch to label a unless we are at the end of the file
s/\n\n/NEWPARA/Replace all instances of two newline characters with NEWPARA
The final version
I appended three more substitutes with sed:
s/\n/ /g– replace a single newline with a space
s/NEWPARA/\n/g– replace NEWPARA with a newline
s/ +/ /g– replace one or more spaces with a single space
Split over multiple lines for readability, the final command was:
$ sed \
':a;N;$!ba;s/\n\n/NEWPARA/g;s/\n/ /g;s/NEWPARA/\n/g;s/ +/ /g' \
The resulting file can be read into a word processor and will have the correct paragraph breaks and no sequences of multiple spaces.
Was this techtip useful?
Let us know in the comments below.