Linux Tips: import a text file to a word processor

From time to time, I’ve wanted to import a simple text file into LibreOffice Writer or other word processing programs. The challenge when doing so is that most text files have embedded newlines in them, and word processing programs interpret newlines as paragraph markers:

The usual technique is to either fix up the import by hand, which is OK for small files, or to go through a tedious sequence of edits to remove all newlines. This looks something like:

search for two consecutive newlines and replace them with the word NEWPARA
search for and remove all remaining newlines
search for NEWPARA and replace it with a newline

That almost works, but you’ll quickly realise that step 2 should be “search for all remaining newlines and replace with a space”. I also often find that the resulting file has some instances of two consecutive spaces.

In real life

Right now, I’m working on a book and the challenge I faced recently was importing 34 text files into one word processing document. The prospect of manually fixing up all 34 files was daunting. This is Linux: there must be an easier way!

I looked at sed, the stream editor. That’s great for making the same changes throughout a file or series of files, which is exactly what I want.

The first challenge is that, by default, sed operates on the input file one line at a time. Changing ‘dog’ to ‘cat’ would be easy:

$ sed --in-place -e 's/dog/cat/g' myfile.txt

But the first replacement I want to do, changing two consecutive newlines toNEWPARA, by definition requires sed to operate on more than one line at a time.

The solution is to have sed read all the file before starting. This command will carry out the first step of our process, that of replacing two consecutive newlines with the wordNEWPARA:

$ sed ':a;N;$!ba;s/\n\n/NEWPARA/' <myfile.txt >output.txt

How it works

:a Create a label called ‘a’
; The semicolon is the sed multi-statement separator
N Read and append the next line of input into the pattern space
; Separator
$!ba Branch to label a unless we are at the end of the file
; Separator
s/\n\n/NEWPARA/ Replace all instances of two newline characters with NEWPARA

The final version

I appended three more substitutes with sed:

s/\n/ /g – replace a single newline with a space
s/NEWPARA/\n/g – replace NEWPARA with a newline
s/ +/ /g – replace one or more spaces with a single space

Split over multiple lines for readability, the final command was:

$ sed \
':a;N;$!ba;s/\n\n/NEWPARA/g;s/\n/ /g;s/NEWPARA/\n/g;s/ +/ /g' \
output.txt

The resulting file can be read into a word processor and will have the correct paragraph breaks and no sequences of multiple spaces.

Was this techtip useful?

Let us know in the comments below.

Linux Tips: import a text file to a word processor

In real life

How it works

The final version

Was this techtip useful?

Further Reading

Filter

Your Research is Cutting-Edge – So Why is Your Linux Infrastructure Stuck in the Jurassic Era?

From IT Support to Strategy: How MSPs Can Move Up the Value Chain

From Visibility to Control: How to Build a Resilient Linux Software Supply Chain

Take the next step towards hassle-free Linux support.

Linux Services

Who We Serve

Linux Insights

About Us

Contact