Hello people,
Alright?
In this article, I will show you how to convert an RTF string to text (Remove RTF tags) using CLR (C #) or Powershell, which was a necessity I had last week, where a system would write the information to a table and the data would be Rich Text Format (RTF). I searched the internet a lot to find solutions that would help me convert RTF to text straight from the database and I didn't find many alternatives that I liked and that motivated me to write this post for you.
If you do not know the CLR or want to know more about it, understand how it works, its limitations, advantages, disadvantages and how to create and publish a new CLR project, read the post Introduction to SQL Common Language Runtime (CLR) in SQL Server.
The trivial solution with C #
If you do this search on Google, probably the most common solution you'll find is this:
1 2 3 4 |
System.Windows.Forms.RichTextBox rtBox = new System.Windows.Forms.RichTextBox(); string rtfText = "{\rtf1\ansi\ansicpg1252\fromhtml1 \fbidis \deff0{\fonttbl {\f0\fswiss Arial;} {\f1\fmodern Courier New;} {\f2\fnil\fcharset2 Symbol;} {\f3\fmodern\fcharset0 Courier New;} {\f4\fswiss Segoe UI;}} {\colortbl\red0\green0\blue0;\red0\green0\blue255;\red0\green0\blue0;} \uc1\pard\plain\deftab360 \f0\fs24 {\*\htmltag0 \par } {\*\htmltag48 }{\*\htmltag64}\htmlrtf {\htmlrtf0 {\*\htmltag148 }\htmlrtf {\cf2 \fs20 \f4 \htmlrtf0 {\*\htmltag4 \par }\htmlrtf }\htmlrtf0 {\*\htmltag72}\htmlrtf}\htmlrtf0 {\*\htmltag96 }\htmlrtf {\ltrpar\ltrch\ql \htmlrtf0 {\*\htmltag64}\htmlrtf {\htmlrtf0 \htmlrtf {\cf2 \fs20 \f4 \htmlrtf0 Dirceu Resende\htmlrtf }\htmlrtf0 {\*\htmltag72}\htmlrtf\par}\htmlrtf0 {\*\htmltag104 }\htmlrtf }\htmlrtf0 {\*\htmltag248 } {\*\htmltag58 }}"; rtBox.Rtf = rtfText; string plainText = rtBox.Text; |
Which is a really simple solution and it really works. However, I would like to use this solution for the database and so using the System.Windows.Forms library is not a very interesting idea, since this library is not loaded by default by SQL Server and so I would have to Import its assembly and its dependencies (which are not few) manually into the database, as shown below:
This would increase the complexity of your CLR (as well as publications) and can have negative effects on the stability of your instance, so I do not recommend this practice.
The C # Solution Using Helper RichTextStripper
After much searching the internet for similar solutions, I found a class called RichTextStripper, written in C # and developed by Chris Benard, which does just that. He found Python code that does this work and converted the code to C #.
Function source code in CLR
This is the function that will be generated in the database and you will use it to remove RTF formatting from strings and columns in SQL Server after your CLR is published and using the RichTextStripper class.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 |
using System.Collections.Generic; using System.Text.RegularExpressions; public partial class UserDefinedFunctions { [Microsoft.SqlServer.Server.SqlFunction] public static string fncRemove_RTF_String(string Ds_String_Rtf) { return RichTextStripper.StripRichTextFormat(Ds_String_Rtf).Trim(); } } /// <summary> /// Rich Text Stripper /// </summary> /// <remarks> /// Translated from Python located at: /// http://stackoverflow.com/a/188877/448 /// https://chrisbenard.net/2014/08/20/extract-text-from-rtf-in-c-net/ /// </remarks> public static class RichTextStripper { private class StackEntry { public int NumberOfCharactersToSkip { get; set; } public bool Ignorable { get; set; } public StackEntry(int numberOfCharactersToSkip, bool ignorable) { NumberOfCharactersToSkip = numberOfCharactersToSkip; Ignorable = ignorable; } } private static readonly Regex _rtfRegex = new Regex(@"\\([a-z]{1,32})(-?\d{1,10})?[ ]?|\\'([0-9a-f]{2})|\\([^a-z])|([{}])|[\r\n]+|(.)", RegexOptions.Singleline | RegexOptions.IgnoreCase); private static readonly List<string> destinations = new List<string> { "aftncn","aftnsep","aftnsepc","annotation","atnauthor","atndate","atnicn","atnid", "atnparent","atnref","atntime","atrfend","atrfstart","author","background", "bkmkend","bkmkstart","blipuid","buptim","category","colorschememapping", "colortbl","comment","company","creatim","datafield","datastore","defchp","defpap", "do","doccomm","docvar","dptxbxtext","ebcend","ebcstart","factoidname","falt", "fchars","ffdeftext","ffentrymcr","ffexitmcr","ffformat","ffhelptext","ffl", "ffname","ffstattext","field","file","filetbl","fldinst","fldrslt","fldtype", "fname","fontemb","fontfile","fonttbl","footer","footerf","footerl","footerr", "footnote","formfield","ftncn","ftnsep","ftnsepc","g","generator","gridtbl", "header","headerf","headerl","headerr","hl","hlfr","hlinkbase","hlloc","hlsrc", "hsv","htmltag","info","keycode","keywords","latentstyles","lchars","levelnumbers", "leveltext","lfolevel","linkval","list","listlevel","listname","listoverride", "listoverridetable","listpicture","liststylename","listtable","listtext", "lsdlockedexcept","macc","maccPr","mailmerge","maln","malnScr","manager","margPr", "mbar","mbarPr","mbaseJc","mbegChr","mborderBox","mborderBoxPr","mbox","mboxPr", "mchr","mcount","mctrlPr","md","mdeg","mdegHide","mden","mdiff","mdPr","me", "mendChr","meqArr","meqArrPr","mf","mfName","mfPr","mfunc","mfuncPr","mgroupChr", "mgroupChrPr","mgrow","mhideBot","mhideLeft","mhideRight","mhideTop","mhtmltag", "mlim","mlimloc","mlimlow","mlimlowPr","mlimupp","mlimuppPr","mm","mmaddfieldname", "mmath","mmathPict","mmathPr","mmaxdist","mmc","mmcJc","mmconnectstr", "mmconnectstrdata","mmcPr","mmcs","mmdatasource","mmheadersource","mmmailsubject", "mmodso","mmodsofilter","mmodsofldmpdata","mmodsomappedname","mmodsoname", "mmodsorecipdata","mmodsosort","mmodsosrc","mmodsotable","mmodsoudl", "mmodsoudldata","mmodsouniquetag","mmPr","mmquery","mmr","mnary","mnaryPr", "mnoBreak","mnum","mobjDist","moMath","moMathPara","moMathParaPr","mopEmu", "mphant","mphantPr","mplcHide","mpos","mr","mrad","mradPr","mrPr","msepChr", "mshow","mshp","msPre","msPrePr","msSub","msSubPr","msSubSup","msSubSupPr","msSup", "msSupPr","mstrikeBLTR","mstrikeH","mstrikeTLBR","mstrikeV","msub","msubHide", "msup","msupHide","mtransp","mtype","mvertJc","mvfmf","mvfml","mvtof","mvtol", "mzeroAsc","mzeroDesc","mzeroWid","nesttableprops","nextfile","nonesttables", "objalias","objclass","objdata","object","objname","objsect","objtime","oldcprops", "oldpprops","oldsprops","oldtprops","oleclsid","operator","panose","password", "passwordhash","pgp","pgptbl","picprop","pict","pn","pnseclvl","pntext","pntxta", "pntxtb","printim","private","propname","protend","protstart","protusertbl","pxe", "result","revtbl","revtim","rsidtbl","rxe","shp","shpgrp","shpinst", "shppict","shprslt","shptxt","sn","sp","staticval","stylesheet","subject","sv", "svb","tc","template","themedata","title","txe","ud","upr","userprops", "wgrffmtfilter","windowcaption","writereservation","writereservhash","xe","xform", "xmlattrname","xmlattrvalue","xmlclose","xmlname","xmlnstbl", "xmlopen" }; private static readonly Dictionary<string, string> specialCharacters = new Dictionary<string, string> { { "par", "\n" }, { "sect", "\n\n" }, { "page", "\n\n" }, { "line", "\n" }, { "tab", "\t" }, { "emdash", "\u2014" }, { "endash", "\u2013" }, { "emspace", "\u2003" }, { "enspace", "\u2002" }, { "qmspace", "\u2005" }, { "bullet", "\u2022" }, { "lquote", "\u2018" }, { "rquote", "\u2019" }, { "ldblquote", "\u201C" }, { "rdblquote", "\u201D" }, }; /// <summary> /// Strip RTF Tags from RTF Text /// </summary> /// <param name="inputRtf">RTF formatted text</param> /// <returns>Plain text from RTF</returns> public static string StripRichTextFormat(string inputRtf) { if (inputRtf == null) { return null; } string returnString; var stack = new Stack<StackEntry>(); var ignorable = false; // Whether this group (and all inside it) are "ignorable". var ucskip = 1; // Number of ASCII characters to skip after a unicode character. var curskip = 0; // Number of ASCII characters left to skip var outList = new List<string>(); // Output buffer. var matches = _rtfRegex.Matches(inputRtf); if (matches.Count > 0) { foreach (Match match in matches) { var word = match.Groups[1].Value; var arg = match.Groups[2].Value; var hex = match.Groups[3].Value; var character = match.Groups[4].Value; var brace = match.Groups[5].Value; var tchar = match.Groups[6].Value; if (!string.IsNullOrEmpty(brace)) { curskip = 0; if (brace == "{") { // Push state stack.Push(new StackEntry(ucskip, ignorable)); } else if (brace == "}") { // Pop state var entry = stack.Pop(); ucskip = entry.NumberOfCharactersToSkip; ignorable = entry.Ignorable; } } else if (!string.IsNullOrEmpty(character)) // \x (not a letter) { curskip = 0; if (character == "~") { if (!ignorable) { outList.Add("\xA0"); } } else if ("{}\\".Contains(character)) { if (!ignorable) { outList.Add(character); } } else if (character == "*") { ignorable = true; } } else if (!string.IsNullOrEmpty(word)) // \foo { curskip = 0; if (destinations.Contains(word)) { ignorable = true; } else if (ignorable) { } else if (specialCharacters.ContainsKey(word)) { outList.Add(specialCharacters[word]); } else if (word == "uc") { ucskip = int.Parse(arg); } else if (word == "u") { var c = int.Parse(arg); if (c < 0) { c += 0x10000; } outList.Add(char.ConvertFromUtf32(c)); curskip = ucskip; } } else if (!string.IsNullOrEmpty(hex)) // \'xx { if (curskip > 0) { curskip -= 1; } else if (!ignorable) { var c = int.Parse(hex, System.Globalization.NumberStyles.HexNumber); outList.Add(char.ConvertFromUtf32(c)); } } else if (!string.IsNullOrEmpty(tchar)) { if (curskip > 0) { curskip -= 1; } else if (!ignorable) { outList.Add(tchar); } } } } else { // Didn't match the regex returnString = inputRtf; } returnString = string.Join(string.Empty, outList.ToArray()); return returnString; } } |
Function usage demo
As you can see below, the result of the function is the same as when you open the document (or string) in Microsoft Word.
The solution with Powershell
If you don't want to use CLR (C #) to accomplish this task, don't worry. You can also do this using Powershell (Posh) and you can integrate it with SQL Server in various ways by using xp_cmdshell and exporting the result to text or capturing the powershell return.
The Powershell script is very simple. Remember that simple C # script I posted at the beginning of the post that I didn't want to implement in CLR because of Windows.Forms dependencies? So .. In powershell this is no problem .. 🙂
Powershell Source Code:
1 2 3 4 5 6 7 8 |
[Reflection.Assembly]::LoadWithPartialName("System.Windows.Forms") | Out-Null $Rtb = New-Object -TypeName System.Windows.Forms.RichTextBox $stringRTF = "{\rtf1\ansi\ansicpg1252\fromhtml1 \fbidis \deff0{\fonttbl {\f0\fswiss Arial;} {\f1\fmodern Courier New;} {\f2\fnil\fcharset2 Symbol;} {\f3\fmodern\fcharset0 Courier New;} {\f4\fswiss Segoe UI;}} {\colortbl\red0\green0\blue0;\red0\green0\blue255;\red0\green0\blue0;} \uc1\pard\plain\deftab360 \f0\fs24 {\*\htmltag0 \par } {\*\htmltag48 }{\*\htmltag64}\htmlrtf {\htmlrtf0 {\*\htmltag148 }\htmlrtf {\cf2 \fs20 \f4 \htmlrtf0 {\*\htmltag4 \par }\htmlrtf }\htmlrtf0 {\*\htmltag72}\htmlrtf}\htmlrtf0 {\*\htmltag96 }\htmlrtf {\ltrpar\ltrch\ql \htmlrtf0 {\*\htmltag64}\htmlrtf {\htmlrtf0 \htmlrtf {\cf2 \fs20 \f4 \htmlrtf0 Dirceu Resende\htmlrtf }\htmlrtf0 {\*\htmltag72}\htmlrtf\par}\htmlrtf0 {\*\htmltag104 }\htmlrtf }\htmlrtf0 {\*\htmltag248 } {\*\htmltag58 }}" $Rtb.Rtf = $stringRTF $Retorno = $Rtb.Text Write-Host($Retorno) |
Powershell Execution Result
Example 1:
With this you can use xp_cmdshell to run your Powershell as needed:

Example Using PowerShell -File and Entering the Path of a File I Previously Created with the Posh Commands Demonstrated Above
2 Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
DECLARE @StringRTF VARCHAR(MAX), @Posh VARCHAR(8000) SET @StringRTF = '{\rtf1\ansi\ansicpg1252\fromhtml1 \fbidis \deff0{\fonttbl {\f0\fswiss Arial;} {\f1\fmodern Courier New;} {\f2\fnil\fcharset2 Symbol;} {\f3\fmodern\fcharset0 Courier New;} {\f4\fswiss Segoe UI;}} {\colortbl\red0\green0\blue0;\red0\green0\blue255;\red0\green0\blue0;} \uc1\pard\plain\deftab360 \f0\fs24 {\*\htmltag0 \par } {\*\htmltag48 }{\*\htmltag64}\htmlrtf {\htmlrtf0 {\*\htmltag148 }\htmlrtf {\cf2 \fs20 \f4 \htmlrtf0 {\*\htmltag4 \par }\htmlrtf }\htmlrtf0 {\*\htmltag72}\htmlrtf}\htmlrtf0 {\*\htmltag96 }\htmlrtf {\ltrpar\ltrch\ql \htmlrtf0 {\*\htmltag64}\htmlrtf {\htmlrtf0 \htmlrtf {\cf2 \fs20 \f4 \htmlrtf0 Dirceu Resende\htmlrtf }\htmlrtf0 {\*\htmltag72}\htmlrtf\par}\htmlrtf0 {\*\htmltag104 }\htmlrtf }\htmlrtf0 {\*\htmltag248 } {\*\htmltag58 }}' SET @Posh = 'powershell.exe -ExecutionPolicy ByPass -Command "Add-Type -AssemblyName System.Windows.Forms; $Rtb = New-Object -TypeName System.Windows.Forms.RichTextBox; $stringRTF = ''' + @StringRTF + '''; $Rtb.Rtf = $stringRTF; $Retorno = $Rtb.Text; Write-Host($Retorno);"' IF (OBJECT_ID('tempdb..#Retorno') IS NOT NULL) DROP TABLE #Retorno CREATE TABLE #Retorno ( Texto VARCHAR(MAX) ) INSERT INTO #Retorno EXEC master.dbo.xp_cmdshell @Posh SELECT * FROM #Retorno |
Result:
That's it folks!
I hope you enjoyed the post and see you next time.
sql server clr converter convert remove remove rdf rich text format tags strings for plain text plain text C # csharp powershell posh ps
sql server clr converter convert remove remove rdf rich text format tags strings for plain text plain text C # csharp powershell posh ps
Good,
When trying the given example using RTF:
{\rtf1\ansi\ansicpg1252\deff0\deflang2070{\fonttbl{\f0\fmodern\fprq6\fcharset134 SimSun;}{\f1\fnil\fcharset0 MS Sans Serif;}} \viewkind4\uc1\pard\lang2052\f0\fs17\’b7\’f2\’c8\’cb\’b7\’eb\’b7\’f2\’c8\’cb6346\’b8\’f6\’ba\’ec\’b5\’c4\’b9\’f0\’bb\’a8\lang2070\f1 \par }
It always gives me a mistake, but if you take it and put it in CMD directly it gives me no problem, what could it be?
On my computer I could not force the conversion in powershell
The two lines below did not make this conversion ...
$ Rtb.Rtf = $ stringRTF
$ Return = $ Rtb.Text
PS C: \> echo $ Rtb
AllowDrop: False
AutoSize: False
AutoWordSelection: False
BackgroundImage:
BackgroundImageLayout: Tile
BulletIndent: 0
CanRedo: False
DetectUrls: True
EnableAutoDragDrop: False
ForeColor: Color [WindowText]
Font: [Font: Name = Microsoft
LanguageOption: 0
MaxLength: 2147483647
Multiline: True
RedoActionName:
RichTextShortcutsEnabled: True
RightMargin: 0
Rtf:
ScrollBars: Both
SelectionAlignment: Left
SelectionBullet: False
SelectionCharOffset: 0
SelectionColor: Color [Empty]
SelectionBackColor: Color [Empty]
SelectionFont:
SelectionHangingIndent: 0
SelectionIndent: 0
SelectionLength: 0
SelectionProtected: False
SelectedRtf:
SelectionRightIndent: 0
SelectionTabs: {}
SelectedText:
SelectionType: Empty
ShowSelectionMargin: False
Text:
TextLength: 0
UndoActionName:
...
'Anyway, I found the various ways to solve this case very cool, I even deal with several for / f “tokens = xyz” delims inside batch files to get maybe the result… it's good when you have simple solutions like it is!